Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
DOI:
https://doi.org/10.5753/jbcs.2025.5788Keywords:
Large Language Models, Large Pretraining Corpus, Multilingual ModelsAbstract
The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.
Downloads
References
Abbas, A., Tirumala, K., Simig, D., Ganguli, S., and Morcos, A. S. (2023). Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, pages 1-34. DOI: 10.48550/arxiv.2303.09540.
Abonizio, H., Almeida, T. S., Laitz, T., Malaquias Junior, R., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá technical report. arXiv preprint arXiv:2410.12049, pages 1-16. DOI: 10.48550/arXiv.2410.12049.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774. DOI: 10.48550/arxiv.2303.08774.
Almeida, T. S., Abonizio, H., Nogueira, R., and Pires, R. (2024). Sabiá-2: A new generation of portuguese large language models. arXiv preprint arXiv:2403.09887, pages 1-21. DOI: 10.48550/arxiv.2403.09887.
Almeida, T. S., Bonás, G. K., Santos, J. G. A., Abonizio, H., and Nogueira, R. (2025). Tiebe: A benchmark for assessing the current knowledge of large language models. arXiv preprint arXiv:2501.07482, pages 1-15. DOI: 10.48550/arXiv.2501.07482.
Almeida, T. S., Laitz, T., Bonás, G. K., and Nogueira, R. (2023). Bluex: A benchmark based on brazilian leading universities entrance exams. In Brazilian Conference on Intelligent Systems, pages 337-347. Springer. DOI: 10.1007/978-3-031-45368-7_22.
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S. M., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. (2024). Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, pages 1-28. DOI: 10.48550/arxiv.2310.10631.
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609, pages 1-59. DOI: 10.48550/arxiv.2309.16609.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., and McKinnon, C. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, pages 1-34. DOI: 10.48550/arxiv.2212.08073.
Barbaresi, A. (2021). Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122-131. DOI: 10.18653/v1/2021.acl-demo.15.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877-1901. DOI: 10.48550/arxiv.2005.14165.
Brum, H. and Volpe Nunes, M. d. G. (2018). Building a sentiment corpus of tweets in brazilian portuguese. In Eleventh International Conference on Language Resources and Evaluation, pages 1-5, Miyazaki, Japan. European Language Resources Association (ELRA). DOI: 10.48550/arxiv.1712.08917.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., and Erlingsson, U. (2021). Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633-2650. DOI: 10.48550/arxiv.2012.07805.
Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., Pagliardini, M., Fan, S., Köpf, A., Mohtashami, A., et al. (2023). Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, pages 1-38. DOI: 10.48550/arxiv.2311.16079.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1-113. DOI: 10.48550/arxiv.2204.02311.
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). Boolq: Exploring the surprising difficulty of natural yes/no questions. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 2924-2936, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.48550/arxiv.1905.10044.
Colombo, P., Pires, T. P., Boudiaf, M., Culver, D., Melo, R., Corro, C., Martins, A. F., Esposito, F., Raposo, V. L., Morgado, S., et al. (2024). Saullm-7b: A pioneering large language model for law. arXiv preprint arXiv:2403.03883, pages 1-13. DOI: 10.48550/arxiv.2403.03883.
Corrêa, N. K., Sen, A., Falk, S., and Fatimah, S. (2024). Tucano: Advancing neural text generation for portuguese. arXiv preprint arXiv:2411.07854, pages 1-34. DOI: 10.1016/j.patter.2025.101325.
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. (2021). Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Conference on Empirical Methods in Natural Language Processing, pages 1286-1305. DOI: 10.18653/v1/2021.emnlp-main.98.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., and Fan, A. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783, pages 1-92. DOI: 10.48550/arXiv.2407.21783.
Endrédy, I. and Novák, A. (2013). More effective boilerplate removal-the goldminer algorithm. Polibits, 48:79-83. Available at: [link].
FitzGerald, J., Hench, C., Peris, C., Mackie, S., Rottmann, K., Sanchez, A., Nash, A., Urbach, L., Kakarala, V., and Singh, R. (2022). Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. arXiv preprint arXiv:2204.08582, pages 1-24. DOI: 10.18653/v1/2023.acl-long.235.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., and Nabeshima, N. (2020). The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, pages 1-39. DOI: 10.48550/arxiv.2101.00027.
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac'h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2024). A framework for few-shot language model evaluation. DOI: 10.5281/zenodo.12608602.
Gogoulou, E., Lesort, T., Boman, M., and Nivre, J. (2024). Continual learning under language shift. In International Conference on Text, Speech, and Dialogue, pages 71-84. Springer. DOI: 10.1007/978-3-031-70563-2_6.
Heafield, K. (2011). Kenlm: Faster and smaller language model queries. In Sixth Workshop on Statistical Machine Translation, pages 187-197. Avaialble at:[link].
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., and Sikasote, C. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50-72. DOI: 10.1162/tacl_a_00447.
Kudugunta, S., Caswell, I., Zhang, B., Garcia, X., Xin, D., Kusupati, A., Stella, R., Bapna, A., and Firat, O. (2024). Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36:1-60. DOI: 10.48550/arxiv.2309.04662.
Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., and Dufour, R. (2024). Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, pages 1-17. DOI: 10.18653/v1/2024.findings-acl.348.
Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2023). Bloom: A 176b-parameter open-access multilingual language model. arXiv 2211.05100, pages 1-74. DOI: 10.48550/arxiv.2211.05100.
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2021). Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, pages 1-22. DOI: 10.18653/v1/2022.acl-long.577.
Lewkowycz, A., Andreassen, A. J., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. (2022). Solving quantitative reasoning problems with language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems, pages 1-54. DOI: 10.48550/arxiv.2206.14858.
Liu, Y., Cao, J., Liu, C., Ding, K., and Jin, L. (2024). Datasets for large language models: A comprehensive survey. arXiv preprint arXiv:2402.18041, pages 1-181. DOI: 10.21203/rs.3.rs-3996137/v1.
Longpre, S., Lu, Y., and Daiber, J. (2021). Mkqa: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389-1406. DOI: 10.1162/tacl_a_00433.
Longpre, S., Singh, N., Cherep, M., Tiwary, K., Materzynska, J., Brannon, W., Mahari, R., Dey, M., Hamdy, M., and Saxena, N. (2024). Bridging the data provenance gap across text, speech and video. arXiv preprint arXiv:2412.17847, pages 1-70.
Lopes, R., Magalhães, J., and Semedo, D. (2024). Glória: A generative and open large language model for portuguese. arXiv 2402.12969. Available at:[link].
Loshchilov, I. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, pages 1-19. DOI: 10.48550/arxiv.1711.05101.
Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142-150. Available at:[link].
Marion, M., Üstün, A., Pozzobon, L., Wang, A., Fadaee, M., and Hooker, S. (2023). When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, pages 1-25. DOI: 10.48550/arxiv.2309.04564.
Martins, P. H., Fernandes, P., Alves, J., Guerreiro, N. M., Rei, R., Alves, D. M., Pombal, J., Farajian, A., Faysse, M., Klimaszewski, M., et al. (2024). Eurollm: Multilingual language models for europe. arXiv preprint arXiv:2409.16235, pages 1-12. DOI: 10.1016/j.procs.2025.02.260.
Merrick, L. (2024). Embedding and clustering your data can improve contrastive pretraining. arXiv preprint arXiv:2407.18887. DOI: 10.48550/arxiv.2407.18887.
Moayeri, M., Tabassi, E., and Feizi, S. (2024). Worldbench: Quantifying geographic disparities in llm factual recall. In ACM Conference on Fairness, Accountability, and Transparency, pages 1211-1228. DOI: 10.1145/3630106.3658967.
Nguyen, T., Van Nguyen, C., Lai, V. D., Man, H., Ngo, N. T., Dernoncourt, F., Rossi, R. A., and Nguyen, T. H. (2024). Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4226-4237. DOI: 10.48550/arXiv.2309.09400.
Nguyen, X.-P., Zhang, W., Li, X., Aljunied, M., Hu, Z., Shen, C., Chia, Y. K., Li, X., Wang, J., Tan, Q., et al. (2023). Seallms: Large language models for southeast asia. arXiv preprint arXiv:2312.00738, pages 1-11. DOI: 10.48550/arXiv.2312.00738.
Nunes, D., Primi, R., Pires, R., Lotufo, R., and Nogueira, R. (2023). Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams. arXiv preprint arXiv:2303.17003. DOI: 10.48550/arxiv.2303.17003.
Overwijk, A., Xiong, C., and Callan, J. (2022). Clueweb22: 10 billion web documents with rich information. In 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360-3362. DOI: 10.1145/3477495.3536321.
Paul, M., Ganguli, S., and Dziugaite, G. K. (2021). Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596-20607. DOI: 10.48550/arxiv.2107.07075.
Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., Wolf, T., et al. (2024a). The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557, pages 1-38. DOI: 10.48550/arxiv.2406.17557.
Penedo, G., Kydlíček, H., Sabolčec, V., Messmer, B., Foroutan, N., Jaggi, M., von Werra, L., and Wolf, T. (2024b). Fineweb2: A sparkling update with 1000s of languages. Available at: [link].
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. (2023). The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, pages 1-32. DOI: 10.48550/arXiv.2306.01116.
Pipatanakul, K., Jirabovonvisut, P., Manakul, P., Sripaisarnmongkol, S., Patomwong, R., Chokchainant, P., and Tharnpipitchai, K. (2023). Typhoon: Thai large language models. arXiv preprint arXiv:2312.13951, pages 1-12. DOI: 10.48550/arxiv.2312.13951.
Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese large language models. In Brazilian Conference on Intelligent Systems, pages 226-240. Springer. DOI: 10.48550/arXiv.2304.07880.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., and Young, S. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, pages 1-120. DOI: 10.48550/arXiv.2112.11446.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(140):5485-5551. DOI: 10.48550/arxiv.1910.10683.
Real, L., Fonseca, E., and Gonçalo Oliveira, H. (2020). The assin 2 shared task: A quick overview. In Computational Processing of the Portuguese Language: 14th International Conference, pages 406-412. Springer. DOI: 10.1007/978-3-030-41505-1_39.
Roberts, A., Chung, H. W., Mishra, G., Levskaya, A., Bradbury, J., Andor, D., Narang, S., Lester, B., Gaffney, C., Mohiuddin, A., et al. (2023). Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1-8. DOI: 10.48550/arXiv.2203.17189.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*, page 441–453. Springer Nature Switzerland. DOI: 10.1007/978-3-031-49008-8_35.
Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., and Remez, T. (2023). Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, pages 1-48. DOI: 10.48550/arxiv.2308.12950.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99-106. DOI: 10.1609/aaai.v34i05.6399.
Santos, R., Silva, J., Gomes, L., Rodrigues, J., and Branco, A. (2024). Advancing generative ai for portuguese with open decoder gervásio pt*. arXiv 2402.18766. [link]. DOI: 10.48550/arxiv.2402.18766.
Sayama, H. F., Araujo, A. V., and Fernandes, E. R. (2019). Faquad: Reading comprehension dataset in the domain of brazilian higher education. In 8th Brazilian Conference on Intelligent Systems, pages 443-448. DOI: 10.1109/bracis.2019.00084.
Shazeer, N. and Stern, M. (2018). Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596-4604. PMLR. DOI: 10.48550/arxiv.1804.04235.
Silveira, I. C. and Maua, D. D. (2018). Advances in automatically solving the enem. In 7th Brazilian Conference on Intelligent Systems, pages 43-48, Los Alamitos, CA, USA. IEEE Computer Society. DOI: 10.1109/bracis.2018.00016.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing, pages 1631-1642. DOI: 10.18653/v1/d13-1170.
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. (2022). Beyond neural scaling laws: Beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523-19536. DOI: 10.48550/arxiv.2206.14486.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, pages 403-417. Springer. DOI: 10.1007/978-3-030-61377-8_28.
Suarez, P. O., Romary, L., and Sagot, B. (2020). A monolingual approach to contextualized word embeddings for mid-resource languages. In 58th Annual Meeting of the Association for Computational Linguistics, pages 1703-1714. DOI: 10.18653/v1/2020.acl-main.156.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, pages 1-77. DOI: 10.48550/arxiv.2307.09288.
Wei, Y., Wang, Z., Liu, J., Ding, Y., and Zhang, L. (2023). Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, pages 1-26. DOI: 10.48550/arxiv.2312.02120.
Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., and Grave, E. (2019). Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359. DOI: 10.48550/arxiv.1911.00359.
Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., and Mann, G. (2023). Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564. DOI: 10.48550/arxiv.2303.17564.
Xu, Z., Liu, Z., Yan, Y., Liu, Z., Xiong, C., and Yu, G. (2024). Cleaner pretraining corpus curation with neural web scraping. arXiv preprint arXiv:2402.14652, pages 1-11. DOI: 10.18653/v1/2024.acl-short.72.
Xue, L. (2020). mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, pages 1-17. DOI: 10.18653/v1/2021.naacl-main.41.
Zhang, P., Zeng, G., Wang, T., and Lu, W. (2024). Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, pages 1-10. DOI: 10.48550/arxiv.2401.02385.
Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. In 28th International Conference on Neural Information Processing Systems-Volume 1, pages 649-657.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini

This work is licensed under a Creative Commons Attribution 4.0 International License.

