Exploring Brazil's LLM Fauna: Investigating the Generative Performance of Large Language Models in Portuguese

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5814

Keywords:

LLMs, NLG-Evaluation, Question-Answering, Summarization, Simplification, Brazilian Portuguese

Abstract

Large Language Models (LLMs) are now embedded in widely used applications worldwide, yet their evaluation still centers on narrow, discriminative benchmarks. These pipelines often overlook key generative aspects such as discourse coherence, linguistic transformations, and adequacy, which are crucial for real-world applications. In addition, most large-scale evaluations remain heavily biased toward English, limiting our understanding of LLM performance in other languages. This research addresses these gaps by presenting a comprehensive analysis of Brazilian Portuguese LLMs across three core Natural Language Generation tasks: summarization, simplification, and generative question answering. We evaluate six Brazilian models and compare them to the widely used GPT-4o. Our findings, supported by diverse automatic metrics, an LLM-as-a-judge framework, and human evaluation, show that GPT-4o series achieves the best generative performance in Portuguese, followed closely by the Sabiá-3 family. While slightly behind, the open-weight model Tucano stands out for its computational efficiency, making it a strong candidate for deployment in resource-constrained settings. The code used to conduct all experiments is publicly available at https://github.com/MeLLL-UFF/brfauna-gen-eval.

Downloads

Download data is not yet available.

References

Aakanksha, Ahmadian, A., Ermis, B., Goldfarb-Tarrant, S., Kreutzer, J., Fadaee, M., and Hooker, S. (2024a). The multilingual alignment prism: Aligning global and local preferences to reduce harm. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12027-12049, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.emnlp-main.671.

Aakanksha, Ahmadian, A., Goldfarb-Tarrant, S., Ermis, B., Fadaee, M., and Hooker, S. (2024b). Mix data or merge models? optimizing for performance and safety in multilingual contexts. In Neurips Safe Generative AI Workshop 2024. Available at: [link].

Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá-3 Technical Report. Available at: [link].

Al-Thanyyan, S. S. and Azmi, A. M. (2021). Automated Text Simplification: A Survey. ACM Comput. Surv., 54(2). DOI: 10.1145/3442695.

Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Étienne Goffinet, Hesslow, D., Launay, J., Malartic, Q., Mazzotta, D., Noune, B., Pannier, B., and Penedo, G. (2023). The falcon series of open language models. DOI: 10.48550/arxiv.2311.16867.

Almeida, T. S., Abonizio, H., Nogueira, R., and Pires, R. (2024). Sabiá-2: A New Generation of Portuguese Large Language Models. Available at: [link].

Almeida, T. S., Laitz, T., Bonás, G. K., and Nogueira, R. (2023). Bluex: A benchmark based on brazilian leading universities entrance exams. In Naldi, M. C. and Bianchi, R. A. C., editors, Intelligent Systems, pages 337-347, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-45368-7_22.

Alva-Manchego, F., Martin, L., Scarton, C., and Specia, L. (2019). EASSE: Easier Automatic Sentence Simplification Evaluation. In Padó, S. and Huang, R., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 49-54, Hong Kong, China. Association for Computational Linguistics. DOI: 10.18653/v1/D19-3009.

Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J. Z., Fredrikson, M., Gal, Y., and Davies, X. (2025). AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. In The Thirteenth International Conference on Learning Representations. Available at: [link].

Artstein, R. and Poesio, M. (2008). Survey article: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics, 34(4):555-596. DOI: 10.1162/coli.07-034-R2.

Assis, G., Vasconcelos, A., de Azevedo, L., Ferro, M., and Paes, A. (2024a). Modestos e Sustentáveis: O Ajuste Eficiente Beneficia Modelos de Língua de Menor Escala em Português? In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 97-107, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2024.245362.

Assis, G., Vianna, D., Pappa, G. L., Plastino, A., Meira Jr, W., da Silva, A. S., and Paes, A. (2024b). Analysis of material facts on financial assets: A generative AI approach. In Chen, C.-C., Liu, X., Hahn, U., Nourbakhsh, A., Ma, Z., Smiley, C., Hoste, V., Das, S. R., Li, M., Ghassemi, M., Huang, H.-H., Takamura, H., and Chen, H.-H., editors, Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing, pages 103-118, Torino, Italia. Association for Computational Linguistics. Available at: [link].

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. (2021). Program synthesis with large language models. Available at: [link].

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou, C., Zhou, J., Zhou, X., and Zhu, T. (2023). Qwen technical report. DOI: 10.48550/arxiv.2309.16609.

Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A. F. T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A. K., Takmaz, E., and Testoni, A. (2024). LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP evaluation tasks. CoRR, abs/2406.18403. DOI: 10.48550/arXiv.2406.18403.

BotBot-AI (2023). Cabra. Available at: [link].

BotBot-AI (2024a). Cabra Llama-3 8B. Available at: [link].

BotBot-AI (2024b). CabraMistral. Available at/:[link]. Accessed: 5 March 2025.

BotBot-AI (2024c). CabraQwen. Available at: [link]. Accessed: 5 March 2025.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877-1901. Curran Associates, Inc. Available at/:[link].

Brum, H. and Volpe Nunes, M. d. G. (2018). Building a sentiment corpus of tweets in Brazilian Portuguese. In Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Available at/:[link].

Budennyy, S. A., Lazarev, V. D., Zakharenko, N. N., Korovin, A. N., Plosskaya, O. A., Dimitrov, D. V., Akhripkin, V. S., Pavlov, I. V., Oseledets, I. V., Barsola, I. S., Egorov, I. V., Kosterina, A. A., and Zhukov, L. E. (2023). eco2AI: Carbon Emissions Tracking of Machine Learning Models as the First Step Towards Sustainable AI. Doklady Mathematics. DOI: 10.1134/S1064562422060230.

Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., Chen, X., Chen, Z., Chen, Z., Chu, P., Dong, X., Duan, H., Fan, Q., Fei, Z., Gao, Y., Ge, J., Gu, C., Gu, Y., Gui, T., Guo, A., Guo, Q., He, C., Hu, Y., Huang, T., Jiang, T., Jiao, P., Jin, Z., Lei, Z., Li, J., Li, J., Li, L., Li, S., Li, W., Li, Y., Liu, H., Liu, J., Hong, J., Liu, K., Liu, K., Liu, X., Lv, C., Lv, H., Lv, K., Ma, L., Ma, R., Ma, Z., Ning, W., Ouyang, L., Qiu, J., Qu, Y., Shang, F., Shao, Y., Song, D., Song, Z., Sui, Z., Sun, P., Sun, Y., Tang, H., Wang, B., Wang, G., Wang, J., Wang, J., Wang, R., Wang, Y., Wang, Z., Wei, X., Weng, Q., Wu, F., Xiong, Y., Xu, C., Xu, R., Yan, H., Yan, Y., Yang, X., Ye, H., Ying, H., Yu, J., Yu, J., Zang, Y., Zhang, C., Zhang, L., Zhang, P., Zhang, P., Zhang, R., Zhang, S., Zhang, S., Zhang, W., Zhang, W., Zhang, X., Zhang, X., Zhao, H., Zhao, Q., Zhao, X., Zhou, F., Zhou, Z., Zhuo, J., Zou, Y., Qiu, X., Qiao, Y., and Lin, D. (2024). Internlm2 technical report. DOI: 10.48550/arxiv.2403.17297.

Carmo, D., Piau, M., Campiotti, I., Nogueira, R., and Lotufo, R. (2020). PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data. Available at: [link].

Cataneo Silveira, I. and Deratani Mauá, D. (2018). Advances in automatically solving the ENEM. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pages 43-48. DOI: 10.1109/BRACIS.2018.00016.

Chen, G. H., Chen, S., Liu, Z., Jiang, F., and Wang, B. (2024). Humans or LLMs as the judge? a study on judgement bias. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301-8327, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.emnlp-main.474.

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. (2021). Evaluating large language models trained on code. DOI: 10.48550/arxiv.2107.03374.

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Available at: [link].

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., and Stoica, I. (2024). Chatbot arena: an open platform for evaluating llms by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org. DOI: 10.48550/arxiv.2403.04132.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. (2024). Scaling Instruction-Finetuned Language Models. Journal of Machine Learning Research, 25(70):1-53. Available at: [link].

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924-2936, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1300.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). Training verifiers to solve math word problems. DOI: 10.48550/arxiv.2110.14168.

Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., and Xin, R. (2023). Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM. Available at: [link].

Corrêa, N. K., Falk, S., Fatimah, S., Sen, A., and De Oliveira, N. (2024a). TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese. Machine Learning with Applications, 16:100558. DOI: 10.1016/j.mlwa.2024.100558.

Corrêa, N. K., Sen, A., Falk, S., and Fatimah, S. (2024b). Tucano: Advancing neural text generation for portuguese. DOI: 10.1016/j.patter.2025.101325.

Cortes, E. G., Vieira, R., and Barone, D. A. C. (2024). Perguntas e respostas. In Caseli, H. M. and Nunes, M. G. V., editors, Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português, book chapter 16. BPLN, 2 edition. Available at: [link].

Dang, J., Singh, S., D'souza, D., Ahmadian, A., Salamanca, A., Smith, M., Peppin, A., Hong, S., Govindassamy, M., Zhao, T., Kublik, S., Amer, M., Aryabumi, V., Campos, J. A., Tan, Y.-C., Kocmi, T., Strub, F., Grinsztajn, N., Flet-Berliac, Y., Locatelli, A., Lin, H., Talupuru, D., Venkitesh, B., Cairuz, D., Yang, B., Chung, T., Ko, W.-Y., Shi, S. S., Shukayev, A., Bae, S., Piktus, A., Castagné, R., Cruz-Salinas, F., Kim, E., Crawhall-Stein, L., Morisot, A., Roy, S., Blunsom, P., Zhang, I., Gomez, A., Frosst, N., Fadaee, M., Ermis, B., Üstün, A., and Hooker, S. (2024). Aya expanse: Combining research breakthroughs for a new multilingual frontier. DOI: 10.48550/arxiv.2412.04261.

Domingues, M. (2023). Canarim-7B-Instruct. Available at: [link]. Accessed: 5 March 2025.

Dong, Y., Li, Z., Rezagholizadeh, M., and Cheung, J. C. K. (2019). EditNTS: An neural programmer-interpreter model for sentence simplification through explicit editing. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3393-3402, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1331.

Feng, Y., Qiang, J., Li, Y., Yuan, Y., and Zhu, Y. (2023). Sentence simplification via large language models. DOI: 10.48550/arxiv.2302.11957.

FitzGerald, J., Hench, C., Peris, C., Mackie, S., Rottmann, K., Sanchez, A., Nash, A., Urbach, L., Kakarala, V., Singh, R., Ranganath, S., Crist, L., Britan, M., Leeuwis, W., Tur, G., and Natarajan, P. (2023). MASSIVE: A 1M-example multilingual natural language understanding dataset with 51 typologically-diverse languages. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4277-4302, Toronto, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/2023.acl-long.235.

Fortuna, P., Rocha da Silva, J., Soler-Company, J., Wanner, L., and Nunes, S. (2019). A hierarchically-labeled Portuguese hate speech dataset. In Roberts, S. T., Tetreault, J., Prabhakaran, V., and Waseem, Z., editors, Proceedings of the Third Workshop on Abusive Language Online, pages 94-104, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/W19-3510.

Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., and Wolf, T. (2024). Open LLM Leaderboard v2. Available at: [link].

Freitag, M., Mathur, N., Deutsch, D., Lo, C.-K., Avramidis, E., Rei, R., Thompson, B., Blain, F., Kocmi, T., Wang, J., Adelani, D. I., Buchicchio, M., Zerva, C., and Lavie, A. (2024). Are LLMs Breaking MT Metrics? results of the WMT24 metrics shared task. In Haddow, B., Kocmi, T., Koehn, P., and Monz, C., editors, Proceedings of the Ninth Conference on Machine Translation, pages 47-81, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.wmt-1.2.

Freitag, M., Mathur, N., Lo, C.-k., Avramidis, E., Rei, R., Thompson, B., Kocmi, T., Blain, F., Deutsch, D., Stewart, C., Zerva, C., Castilho, S., Lavie, A., and Foster, G. (2023). Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent. In Koehn, P., Haddow, B., Kocmi, T., and Monz, C., editors, Proceedings of the Eighth Conference on Machine Translation, pages 578-628, Singapore. Association for Computational Linguistics. DOI: 10.18653/v1/2023.wmt-1.51.

Garcia, E. A. S. (2024). Open portuguese llm leaderboard. Available at: [link].

Garcia, G. L., Paiola, P. H., Morelli, L. H., Candido, G., Júnior, A. C., Jodas, D. S., Afonso, L. C. S., Guilherme, I. R., Penteado, B. E., and Papa, J. P. (2024). Introducing bode: A fine-tuned large language model for portuguese prompt-based task. DOI: 10.48550/arxiv.2401.02909.

Gatt, A. and Krahmer, E. (2018). Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. Artif. Int. Res., 61(1):65–170. DOI: 10.1613/jair.5477.

Gemini-Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., Firat, O., Molloy, J., Isard, M., Barham, P. R., Hennigan, T., Lee, B., Viola, F., Reynolds, M., Xu, Y., Doherty, R., Collins, E., Meyer, C., Rutherford, E., Moreira, E., Ayoub, K., Goel, M., Krawczyk, J., Du, C., Chi, E., Cheng, H.-T., Ni, E., Shah, P., Kane, P., Chan, B., Faruqui, M., Severyn, A., Lin, H., Li, Y., Cheng, Y., Ittycheriah, A., Mahdieh, M., Chen, M., Sun, P., Tran, D., and et al. (2023). Gemini: A Family of Highly Capable Multimodal Models. CoRR, abs/2312.11805. DOI: 10.48550/ARXIV.2312.11805.

Gemma-Team (2024). Gemma 2: Improving Open Language Models at a Practical Size. Available at: [link].

Gemma-Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A., Crepy, C., Cer, D., Ippolito, D., Reid, D., Buchatskaya, E., Ni, E., Noland, E., Yan, G., Tucker, G., Muraru, G.-C., Rozhdestvenskiy, G., Michalewski, H., Tenney, I., Grishchenko, I., Austin, J., Keeling, J., Labanowski, J., Lespiau, J.-B., Stanway, J., Brennan, J., Chen, J., Ferret, J., Chiu, J., Mao-Jones, J., Lee, K., Yu, K., Millican, K., Sjoesund, L. L., Lee, L., Dixon, L., Reid, M., Mikuła, M., Wirth, M., Sharman, M., Chinaev, N., Thain, N., Bachem, O., Chang, O., Wahltinez, O., Bailey, P., Michel, P., Yotov, P., Chaabouni, R., Comanescu, R., Jana, R., Anil, R., McIlroy, R., Liu, R., Mullins, R., Smith, S. L., Borgeaud, S., Girgin, S., Douglas, S., Pandya, S., Shakeri, S., De, S., Klimenko, T., Hennigan, T., Feinberg, V., Stokowiec, W., hui Chen, Y., Ahmed, Z., Gong, Z., Warkentin, T., Peran, L., Giang, M., Farabet, C., Vinyals, O., Dean, J., Kavukcuoglu, K., Hassabis, D., Ghahramani, Z., Eck, D., Barral, J., Pereira, F., Collins, E., Joulin, A., Fiedel, N., Senter, E., Andreev, A., and Kenealy, K. (2024). Gemma: Open models based on gemini research and technology. DOI: 10.48550/arxiv.2403.08295.

Geng, X. and Liu, H. (2023). OpenLLaMA: An Open Reproduction of LLaMA. Available at: [link].

Gibaut, W. (2023). Periquito-3B. Available at [link].

Government of Brazil (2025). Brazilian Artificial Intelligence Plan (PBIA) 2024–2028: AI for the Good of All. Available at: [link]. Accessed on June 29, 2025.

Guha, N., Nyarko, J., Ho, D. E., Re, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, M., Ma, M., Livermore, M., Rasumov-Rahe, N., Holzenberger, N., Kolt, N., Henderson, P., Rehaag, S., Goel, S., Gao, S., Williams, S., Gandhi, S., Zur, T., Iyer, V., and Li, Z. (2023). Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. DOI: 10.2139/ssrn.4583531.

Guillou, P. (2020). GPorTuguese-2: a Language Model for Portuguese Text Generation. Available at: [link].

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021a). Measuring massive multitask language understanding. In International Conference on Learning Representations. Available at: [link].

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021b). Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Available at: [link].

Henrique, B. (2023a). Caramelo-7B. Available at: [link]. Accessed: 5 March 2025.

Henrique, B. (2023b). Harpia-7B. Available at: [link]. Accessed: 5 March 2025.

Hovy, D. and Spruit, S. L. (2016). The Social Impact of Natural Language Processing. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591-598, Berlin, Germany. Association for Computational Linguistics. DOI: 10.18653/v1/P16-2096.

Hu, Y., Gan, L., Xiao, W., Kuang, K., and Wu, F. (2025). Fine-tuning large language models for improving factuality in legal question answering. In Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B. D., and Schockaert, S., editors, Proceedings of the 31st International Conference on Computational Linguistics, pages 4410-4427, Abu Dhabi, UAE. Association for Computational Linguistics. Available at: [link].

Ip, J. and Vongthongsri, K. (2025). DeepEval. Available at: [link].

Javaheripi, M., Bubeck, S., Abdin, M., Aneja, J., Bubeck, S., Mendes, C. C. T., Chen, W., Del Giorno, A., Eldan, R., Gopi, S., et al. (2023). Phi-2: The surprising power of small language models. Microsoft Research Blog. Available at: [link].

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de Las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7B. CoRR, abs/2310.06825. DOI: 10.48550/ARXIV.2310.06825.

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. R. (2024). SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations. Available at: [link].

Jorge, G. A. Z., Bezerra, D. A., Xavier, C. C., and Pardo, T. A. S. (2025). Multilingual extractive summarization: Investigating state-of-the-art methods for english and brazilian portuguese. In Paes, A. and Verri, F. A. N., editors, Intelligent Systems, pages 212-223, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-79032-4_15.

Joshi, P., Santy, S., Budhiraja, A., Bali, K., and Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282-6293, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl-main.560.

Katz, D. M., Bommarito, M. J., Gao, S., and Arredondo, P. (2024). GPT-4 passes the bar exam. Philos Trans A Math Phys Eng Sci, 382(2270):20230254. DOI: 10.1098/rsta.2023.0254.

Kew, T., Chi, A., Vásquez-Rodríguez, L., Agrawal, S., Aumiller, D., Alva-Manchego, F., and Shardlow, M. (2023). BLESS: Benchmarking large language models on sentence simplification. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13291-13309, Singapore. Association for Computational Linguistics. DOI: 10.18653/v1/2023.emnlp-main.821.

Kumar, D., Mou, L., Golab, L., and Vechtomova, O. (2020). Iterative edit-based unsupervised sentence simplification. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7918-7928, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl-main.707.

Larcher, C., Piau, M., Finardi, P., Gengo, P., Esposito, P., and Caridá, V. (2023). Cabrita: closing the gap for foreign languages. DOI: 10.48550/arxiv.2308.11878.

Lavie, A. and Agarwal, A. (2007). Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT '07, page 228–231, USA. Association for Computational Linguistics. Available at: [link].

Leal, S. E. and Aluísio, S. M. (2024). Complexidade textual e suas tarefas relacionadas. In Caseli, H. M. and Nunes, M. G. V., editors, Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português, book chapter 25. BPLN, 3 edition. Available at: [link].

Leal, S. E., Duran, M. S., and Aluíso, S. M. (2018). A nontrivial sentence corpus for the task of sentence readability assessment in portuguese. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), pages 401-413, Santa Fe, New Mexico, USA. Available at: [link].

Leal, S. E., Duran, M. S., Scarton, C. E., Hartmann, N. S., and Aluísio, S. M. (2024). NILC-Metrix: assessing the complexity of written and spoken language in brazilian portuguese. Language Resources and Evaluation, 58(1):73-110. DOI: 10.1007/s10579-023-09693-w.

Lee, S., Yu, S., Park, J., Yi, J., and Yoon, S. (2024). Interactive text-to-image retrieval with large language models: A plug-and-play approach. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 791-809, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.acl-long.46.

Lehdonvirta, V., Wú, B., and Hawkins, Z. (2025). Compute North vs. Compute South: The Uneven Possibilities of Compute-Based AI Governance Around the Globe, page 828–838. AAAI Press. DOI: 10.1609/aies.v7i1.31683.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871-7880, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl-main.703.

Li, P., Yang, J., Islam, M. A., and Ren, S. (2023). Making AI less ``Thirsty': Uncovering and Addressing the Secret Water Footprint of AI models. DOI: 10.48550/arXiv.2304.03271.

Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74-81, Barcelona, Spain. Association for Computational Linguistics. Available at: [link].

Lin, S., Hilton, J., and Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214-3252, Dublin, Ireland. Association for Computational Linguistics. DOI: 10.18653/v1/2022.acl-long.229.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. (2023). G-eval: NLG evaluation using gpt-4 with better human alignment. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511-2522, Singapore. Association for Computational Linguistics. DOI: 10.18653/v1/2023.emnlp-main.153.

Longpre, S., Lu, Y., and Daiber, J. (2021). MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389-1406. DOI: 10.1162/tacl_a_00433.

Longpre, S., Singh, N., Cherep, M., Tiwary, K., Materzynska, J., Brannon, W., Mahari, R., Obeng-Marnu, N., Dey, M., Hamdy, M., Saxena, N., Anis, A. M., Alghamdi, E. A., Chien, V. M., Yin, D., Qian, K., LI, Y., Liang, M., Dinh, A., Mohanty, S., Mataciunas, D., South, T., Zhang, J., Lee, A. N., Lund, C. S., Klamm, C., Sileo, D., Misra, D., Shippole, E., Klyman, K., Miranda, L. J. V., Muennighoff, N., Ye, S., Kim, S., Gupta, V., Sharma, V., Zhou, X., Xiong, C., Villa, L., Biderman, S., Pentland, A., Hooker, S., and Kabbara, J. (2025). Bridging the Data Provenance Gap Across Text, Speech, and Video. In The Thirteenth International Conference on Learning Representations. Available at: [link].

Luccioni, S., Gamazaychikov, B., Strubell, E., Hooker, S., Jernite, Y., Wu, C.-J., and Mitchell, M. (2025). Ai energy score. Available at: [link].

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R., editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142-150, Portland, Oregon, USA. Association for Computational Linguistics. Available at: [link].

Malaquias-Junior, R., Pires, R., Romero, R., and Nogueira, R. (2024). Juru: Legal Brazilian Large Language Model from Reputable Sources. Available at: [link].

Martin, L., Fan, A., de la Clergerie, É., Bordes, A., and Sagot, B. (2022). MUSS: Multilingual unsupervised sentence simplification by mining paraphrases. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S., editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1651-1664, Marseille, France. European Language Resources Association. Available at: [link].

Martínez, E. (2024). Re-evaluating gpt-4's bar exam performance. Artificial Intelligence and Law. DOI: 10.1007/s10506-024-09396-9.

Masterman, T., Besen, S., Sawtell, M., and Chao, A. (2024). The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. Available at: [link].

Melo, G., Imaizumi, V., and Cozman, F. (2019). Winograd schemas in portuguese. In Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, pages 787-798, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/eniac.2019.9334.

Meta-AI (2024). The Llama 3 Herd of Models. Available at: [link].

Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. (2024). GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Representations. Available at: [link].

Mitchell, E., Rafailov, R., Sharma, A., Finn, C., and Manning, C. (2023). An emulator for fine-tuning large language models using small language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. DOI: 10.48550/arxiv.2310.12962.

Nunes, D., Primi, R., Pires, R., Lotufo, R., and Nogueira, R. (2023). Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams. DOI: 10.48550/arxiv.2303.17003.

OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Duberstein, A., Kondrich, A., Mishchenko, A., Applebaum, A., Jiang, A., Nair, A., Zoph, B., Ghorbani, B., Rossen, B., Sokolowsky, B., Barak, B., McGrew, B., Minaiev, B., Hao, B., Baker, B., Houghton, B., McKinzie, B., Eastman, B., Lugaresi, C., Bassin, C., Hudson, C., Li, C. M., de Bourcy, C., Voss, C., Shen, C., Zhang, C., Koch, C., Orsinger, C., Hesse, C., Fischer, C., Chan, C., Roberts, D., Kappler, D., Levy, D., Selsam, D., Dohan, D., Farhi, D., Mely, D., Robinson, D., and et al. (2024a). Openai o1 system card. DOI: 10.48550/arxiv.2412.16720.

OpenAI, Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Mądry, A., Baker-Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A. T., Kirillov, A., Christakis, A., Conneau, A., Kamali, A., Jabri, A., Moyer, A., Tam, A., Crookes, A., Tootoochian, A., Tootoonchian, A., Kumar, A., Vallone, A., Karpathy, A., Braunstein, A., Cann, A., Codispoti, A., Galu, A., Kondrich, A., Tulloch, A., Mishchenko, A., Baek, A., Jiang, A., Pelisse, A., Woodford, A., Gosalia, A., Dhar, A., Pantuliano, A., Nayak, A., Oliver, A., Zoph, B., Ghorbani, B., Leimberger, B., Rossen, B., Sokolowsky, B., Wang, B., Zweig, B., Hoover, B., Samic, B., McGrew, B., Spero, B., Giertler, B., and et al. (2024b). Gpt-4o system card. DOI: 10.48550/arxiv.2410.21276.

Overwijk, A., Xiong, C., and Callan, J. (2022). ClueWeb22: 10 Billion Web Documents with Rich Information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, page 3360–3362, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3477495.3536321.

Paiola, P. H., Garcia, G. L., Jodas, D. S., Correia, J. V. M., Sugi, L. A., and Papa, J. P. (2024a). RecognaSumm: A novel Brazilian summarization dataset. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 575-579, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics. Available at: [link].

Paiola, P. H., Garcia, G. L., Manesco, J. R. R., Roder, M., Rodrigues, D., and Papa, J. P. (2024b). Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation. Available at: [link].

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, page 311–318, USA. Association for Computational Linguistics. DOI: 10.3115/1073083.1073135.

Paschoal, A. F. A., Pirozelli, P., Freire, V., Delgado, K. V., Peres, S. M., José, M. M., Nakasato, F., Oliveira, A. S., Brandão, A. A. F., Costa, A. H. R., and Cozman, F. G. (2021). Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management, CIKM ’21, page 4544–4553. ACM. DOI: 10.1145/3459637.3482012.

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. (2024). Gorilla: Large language model connected with massive APIs. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Available at: [link].

Pawar, S., Park, J., Jin, J., Arora, A., Myung, J., Yadav, S., Haznitrama, F. G., Song, I., Oh, A., and Augenstein, I. (2025). Survey of Cultural Awareness in Language Models: Text and Beyond. Computational Linguistics, pages 1-96. DOI: 10.1162/COLI.a.14.

Pereira, F. V., Frazão, A., and Moreira, V. P. (2025). Automatic text simplification for the legal domain in brazilian portuguese. In Paes, A. and Verri, F. A. N., editors, Intelligent Systems, pages 31-45, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-79038-6_3.

Piau, M., Lotufo, R., and Nogueira, R. (2024). ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language. In Anais da XXXIV Brazilian Conference on Intelligent Systems, pages 324-338, Porto Alegre, RS, Brasil. SBC. Available at: [link].

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese Large Language Models, page 226–240. Springer Nature Switzerland. DOI: 10.1007/978-3-031-45392-2-15.

Pirozelli, P., José, M. M., Silveira, I., Nakasato, F., Peres, S. M., Brandão, A. A. F., Costa, A. H. R., and Cozman, F. G. (2024). Benchmarks for pirá 2.0, a reading comprehension dataset about the ocean, the brazilian coast, and climate change. Data Intelligence, 6(1):29-63. DOI: 10.1162/dint_a_00245.

Qwen Team (2024). Introducing Qwen1.5. Available at: [link].

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. Open AI. Available at: [link].

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1). DOI: 10.48550/arxiv.1910.10683.

Real, L., Fonseca, E., and Gonçalo Oliveira, H. (2020). The assin 2 shared task: A quick overview. In Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., and Gonçalves, T., editors, Computational Processing of the Portuguese Language, pages 406-412, Cham. Springer International Publishing. DOI: 10.1007/978-3-030-41505-1_39.

Recogna-NLP (2024a). Bode-3.1-8B-Instruct. Available at: [link]. Accessed: 5 March 2025.

Recogna-NLP (2024b). Doutor-bode. Available at: [link]. Accessed: 5 March 2025.

Recogna-NLP (2024c). GemBode-7B. Available at: [link]. Accessed: 5 March 2025.

Recogna-NLP (2024d). InternLm-ChatBode. Available at: [link]. Accessed: 5 March 2025.

Recogna-NLP (2024e). Mistral-Bode. Available at: [link]. Accessed: 5 March 2025.

Recogna-NLP (2024f). Phi-Bode. Available at: [link]. Accessed: 5 March 2025.

Recogna-NLP (2024g). QwenBode. Available at: [link]. Accessed: 5 March 2025.

Rehman, T., Das, S., Sanyal, D. K., and Chattopadhyay, S. (2022). An Analysis of Abstractive Text Summarization Using Pre-trained Models. In Mandal, L., Tavares, J. M. R. S., and Balas, V. E., editors, Proceedings of International Conference on Computational Intelligence, Data Science and Cloud Computing, pages 253-264, Singapore. Springer Nature Singapore. DOI: 10.48550/arXiv.2303.12796.

Reidsma, D. and op den Akker, R. (2008). Exploiting `Subjective' Annotations. In Artstein, R., Boleda, G., Keller, F., and Schulte im Walde, S., editors, Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics, pages 8-16, Manchester, UK. Coling 2008 Organizing Committee. Avaialble at: [link].

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. (2024). GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling. Avaialble at: [link].

Rino, L. H. M. and Pardo, T. A. S. (2003). A sumarização automática de textos: Principais características e metodologias. In Anais do XXIII Congresso da Sociedade Brasileira de Computação, Vol. VIII: III Jornada de Minicursos de Inteligência Artificial (III MCIA), pages 203-245, Campinas-SP. Avaialble at: [link].

Sai, A. B., Mohankumar, A. K., and Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv., 55(2). DOI: 10.1145/3485766.

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. (2021). WinoGrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106. DOI: 10.1145/3474381.

Santa Brígida, L. (2024a). Boto-7B. Available at:[link].

Santa Brígida, L. (2024b). Boto-9B. Available at:[link].

Santos, R., Silva, J. R., Gomes, L., Rodrigues, J., and Branco, A. (2024). Advancing Generative AI for Portuguese with Open Decoder gervásio PT*. In Melero, M., Sakti, S., and Soria, C., editors, Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 16-26, Torino, Italia. ELRA and ICCL. Available at:[link].

Sarmento, M. and de Oliveira, H. (2024). Sumarização automática de artigos de notícias em português: Da extração à abstração com abordagens clássicas e modelos de neurais. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 139-148, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2024.245395.

Sayama, H. F., Araujo, A. V., and Fernandes, E. R. (2019). Faquad: Reading comprehension dataset in the domain of brazilian higher education. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pages 443-448. DOI: 10.1109/BRACIS.2019.00084.

Scalercio, A., de Souza, E. A., Finatto, M. J. B., and Paes, A. (2025). Evaluating LLMs for Portuguese Sentence Simplification with Linguistic Insights. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), Vienna, Austria. Association for Computational Linguistics. To appear. DOI: 10.18653/v1/2025.acl-long.1193.

Scalercio, A., Finatto, M., and Paes, A. (2024). Enhancing sentence simplification in Portuguese: Leveraging paraphrases, context, and linguistic features. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 15076-15091, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-acl.895.

Schneider, E. T. R., de Souza, J. V. A., Gumiel, Y. B., Moro, C., and Paraiso, E. C. (2021). A GPT-2 Language Model for Biomedical Texts in Portuguese. In 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), pages 474-479. DOI: 10.1109/CBMS52027.2021.00056.

Sellam, T., Das, D., and Parikh, A. (2020). BLEURT: Learning robust metrics for text generation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881-7892, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl-main.704.

Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G., Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong, W. Q., Susanto, Y., Ng, R., Longpre, S., Ko, W.-Y., Ruder, S., Smith, M., Bosselut, A., Oh, A., Martins, A. F. T., Choshen, L., Ippolito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker, S. (2025). Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation. DOI: 10.48550/arxiv.2412.03304.

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Schärli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Agüera y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., Liu, Y., Rajkomar, A., Barral, J., Semturs, C., Karthikesalingam, A., and Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972):172-180. DOI: 10.1038/s41586-023-06291-2.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S., editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631-1642, Seattle, Washington, USA. Association for Computational Linguistics. DOI: 10.18653/v1/d13-1170.

Souza, E., Silva, P., Gomes, D., Batista, V., Batista, E., and Pacheco, M. (2024a). TableRAG: A Novel Approach for Augmenting LLMs with Information from Retrieved Tables. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 182-191, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2024.245371.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems, pages 403-417. Springer. DOI: 10.1007/978-3-030-61377-8_28.

Souza, J. W. d. C., Cardoso, P. C. F., and Paixão, C. A. (2024b). Sumarização automática. In Caseli, H. M. and Nunes, M. G. V., editors, Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português, book chapter 24. BPLN, 3 edition. Available at:[link].

Sprague, Z. R., Ye, X., Bostrom, K., Chaudhuri, S., and Durrett, G. (2024). MuSR: Testing the limits of chain-of-thought with multistep soft reasoning. In The Twelfth International Conference on Learning Representations. Available at:[link].

Srivastava, A. and Memon, A. (2024). Toward Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models. IEEE Access, 12:117483-117503. DOI: 10.1109/ACCESS.2024.3446854.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., Xiang, A., Parrish, A., Nie, A., Hussain, A., Askell, A., Dsouza, A., Slone, A., Rahane, A., Iyer, A. S., Andreassen, A. J., Madotto, A., Santilli, A., Stuhlmüller, A., Dai, A. M., La, A., Lampinen, A. K., Zou, A., Jiang, A., Chen, A., Vuong, A., Gupta, A., Gottardi, A., Norelli, A., Venkatesh, A., Gholamidavoodi, A., Tabassum, A., Menezes, A., Kirubarajan, A., Mullokandov, A., Sabharwal, A., Herrick, A., Efrat, A., Erdem, A., Karakacs, A., Roberts, B. R., Loe, B. S., Zoph, B., Bojanowski, B., Özyurt, B., and et al. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research. Featured Certification. DOI: 10.48550/arxiv.2206.04615.

Sulem, E., Abend, O., and Rappoport, A. (2018). BLEU is not suitable for the evaluation of text simplification. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J., editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 738-744, Brussels, Belgium. Association for Computational Linguistics. DOI: 10.18653/v1/D18-1081.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following LLaMA model. Available at:[link].

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023a). LLaMA: Open and Efficient Foundation Language Models. CoRR, abs/2302.13971. DOI: 10.48550/ARXIV.2302.13971.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. (2023b). Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR, abs/2307.09288. DOI: 10.48550/ARXIV.2307.09288.

Vargas, F., Carvalho, I., Rodrigues de Góes, F., Pardo, T., and Benevenuto, F. (2022). HateBR: A large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S., editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7174-7183, Marseille, France. European Language Resources Association. Available at:[link].

Vasilyev, O., Dharnidharka, V., and Bohannon, J. (2020). Fill in the BLANC: Human-free quality estimation of document summaries. In Eger, S., Gao, Y., Peyrard, M., Zhao, W., and Hovy, E., editors, Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11-20, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.eval4nlp-1.2.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. DOI: 10.48550/arxiv.1706.03762.

Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Available at:[link].

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA. DOI: 10.48550/arXiv.1905.00537.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Linzen, T., Chrupała, G., and Alishahi, A., editors, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353-355, Brussels, Belgium. Association for Computational Linguistics. DOI: 10.18653/v1/W18-5446.

Wang, B. and Komatsuzaki, A. (2021). GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. Available at:[link].

Wang, H., Li, J., Wu, H., Hovy, E., and Sun, Y. (2023). Pre-Trained Language Models and Their Applications. Engineering, 25:51-65. DOI: 10.1016/j.eng.2022.04.024.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. (2020). Transformers: State-of-the-Art Natural Language Processing. In Liu, Q. and Schlangen, D., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38-45, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.emnlp-demos.6.

Xie, Q., Han, W., Chen, Z., Xiang, R., Zhang, X., He, Y., Xiao, M., Li, D., Dai, Y., Feng, D., Xu, Y., Kang, H., Kuang, Z., Yuan, C., Yang, K., Luo, Z., Zhang, T., Liu, Z., XIONG, G., Deng, Z., Jiang, Y., Yao, Z., Li, H., Yu, Y., Hu, G., Jiajia, H., Liu, X.-Y., Lopez-Lira, A., Wang, B., Lai, Y., Wang, H., Peng, M., Ananiadou, S., and Huang, J. (2024). Finben: An holistic financial benchmark for large language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Available at:[link].

Xu, W., Napoles, C., Pavlick, E., Chen, Q., and Callison-Burch, C. (2016). Optimizing Statistical Machine Translation for Text Simplification. In Transactions of the Association for Computational Linguistics, volume 4, pages 401-415. Available at:[link].

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y., editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483-498, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.naacl-main.41.

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Yang, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai, S., Tan, S., Zhu, T., Li, T., Liu, T., Ge, W., Deng, X., Zhou, X., Ren, X., Zhang, X., Wei, X., Ren, X., Liu, X., Fan, Y., Yao, Y., Zhang, Y., Wan, Y., Chu, Y., Liu, Y., Cui, Z., Zhang, Z., Guo, Z., and Fan, Z. (2024). Qwen2 technical report. DOI: 10.48550/arxiv.2407.10671.

Yuan, R., Lin, H., Wang, Y., Tian, Z., Wu, S., Shen, T., Zhang, G., Wu, Y., Liu, C., Zhou, Z., Xue, L., Ma, Z., Liu, Q., Zheng, T., Li, Y., Ma, Y., Liang, Y., Chi, X., Liu, R., Wang, Z., Lin, C., Liu, Q., Jiang, T., Huang, W., Chen, W., Fu, J., Benetos, E., Xia, G., Dannenberg, R., Xue, W., Kang, S., and Guo, Y. (2024). ChatMusician: Understanding and generating music intrinsically with LLM. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6252-6271, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-acl.373.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791-4800, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1472.

Zhang, J., Zhao, Y., Saleh, M., and Liu, P. J. (2020a). Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org. DOI: 10.48550/arxiv.1912.08777.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020b). BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. Available at:[link].

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS'15, page 649–657, Cambridge, MA, USA. MIT Press. DOI: 10.48550/arXiv.1509.01626.

Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., and Huang, M. (2024). SafetyBench: Evaluating the safety of large language models. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537-15553, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.acl-long.830.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Available at:[link].

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA. Curran Associates Inc.. DOI: 10.48550/arxiv.2306.05685.

Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. (2024). AGIEval: A human-centric benchmark for evaluating foundation models. In Duh, K., Gomez, H., and Bethard, S., editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299-2314, Mexico City, Mexico. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-naacl.149.

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. (2023). Instruction-following evaluation for large language models. DOI: 10.48550/arxiv.2311.07911.

Downloads

Published

2025-10-08

How to Cite

Assis, G., Freitas, C., & Paes, A. (2025). Exploring Brazil’s LLM Fauna: Investigating the Generative Performance of Large Language Models in Portuguese. Journal of the Brazilian Computer Society, 31(1), 940–972. https://doi.org/10.5753/jbcs.2025.5814

Issue

Section

Articles