BRoverbs - Measuring how much LLMs understand Portuguese proverbs

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5797

Keywords:

Large Language Models, LLM Benchmark, Portuguese LLM Evaluation

Abstract

Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at  https://huggingface.co/datasets/Tropic-AI/BRoverbs.

Downloads

Download data is not yet available.

Author Biography

Giovana Kerche Bonás, Institute of Computing, University of Campinas. Maritaca AI

Master student at IC, UNICAMP.

References

Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá-3 technical report. arXiv preprint arXiv:2410.12049. DOI: 10.48550/arXiv.2410.12049.

Alfina, I., Mulia, R., Fanany, M. I., and Ekanata, Y. (2017). Hate speech detection in the indonesian language: A dataset and preliminary study. In 2017 international conference on advanced computer science and information systems (ICACSIS), pages 233-238. IEEE. DOI: 10.1109/icacsis.2017.8355039.

Almeida, T. S., Abonizio, H., Nogueira, R., and Pires, R. (2024). Sabiá-2: A new generation of portuguese large language models. arXiv preprint arXiv:2403.09887. DOI: 10.48550/arXiv.2403.09887.

Almeida, T. S., Bonás, G. K., Santos, J. G. A., Abonizio, H., and Nogueira, R. (2025a). Tiebe: A benchmark for assessing the current knowledge of large language models. arXiv preprint arXiv:2501.07482. DOI: 10.48550/arXiv.2501.07482.

Almeida, T. S., Laitz, T., Bonás, G. K., and Nogueira, R. (2023). Bluex: A benchmark based on brazilian leading universities entrance exams. In Brazilian Conference on Intelligent Systems, pages 337-347. Springer. DOI: 10.1007/978-3-031-45368-7_22.

Almeida, T. S., Nogueira, R., and Pedrini, H. (2025b). Building high-quality datasets for portuguese llms: From common crawl snapshots to industrial-grade corpora. To Appear. DOI: 10.48550/arXiv.2509.08824.

Anthropic (2024a). Introducing claude 3.5 haiku. Available online [link].

Anthropic (2024b). Introducing claude 3.5 sonnet. Available online [link].

Azime, I. A., Tonja, A. L., Belay, T. D., Chanie, Y., Balcha, B. F., Abadi, N. H., Ademtew, H. B., Nerea, M. A., Yadeta, D. D., Geremew, D. D., et al. (2024). Proverbeval: Exploring llm evaluation challenges for low-resource language understanding. arXiv preprint arXiv:2411.05049. DOI: 10.18653/v1/2025.findings-naacl.350.

Baucells, I., Aula-Blasco, J., de Dios-Flores, I., Suárez, S. P., Pérez, N., Salles, A., Docio, S. S., Falcão, J., Saiz, J. J., Sepúlveda-Torres, R., et al. (2025). Iberobench: A benchmark for llm evaluation in iberian languages. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10491-10519. Avilable at:[link].

Brum, H. B. and Nunes, M. d. G. V. (2017). Building a sentiment corpus of tweets in brazilian portuguese. arXiv preprint arXiv:1712.08917. DOI: 10.48550/arxiv.1712.08917.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. DOI: 10.48550/arxiv.2107.03374.

Corrêa, N. K., Sen, A., Falk, S., and Fatimah, S. (2024). Tucano: Advancing neural text generation for portuguese. arXiv preprint arXiv:2411.07854. DOI: 10.1016/j.patter.2025.101325.

Delfino, P., Cuconato, B., Haeusler, E. H., and Rademaker, A. (2017). Passing the brazilian oab exam: data preparation and some experiments. In Legal knowledge and information systems, pages 89-94. IOS Press. DOI: 10.3233/978-1-61499-838-9-89.

Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161. DOI: h10.48550/arxiv.1903.00161.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783. DOI: 10.48550/arXiv.2407.21783.

Fortuna, P., da Silva, J. R., Wanner, L., Nunes, S., et al. (2019). A hierarchically-labeled portuguese hate speech dataset. In Proceedings of the third workshop on abusive language online, pages 94-104. DOI: 10.18653/v1/w19-3510.

Giagkou, M., Lynn, T., Dunne, J., Piperidis, S., and Rehm, G. (2023). European language technology in 2022/2023. In European Language Equality: A Strategic Agenda for Digital Language Equality, pages 75-94. Springer. DOI: 10.1007/978-3-031-28819-7_4.

Hasan, T., Bhattacharjee, A., Islam, M. S., Samin, K., Li, Y.-F., Kang, Y.-B., Rahman, M. S., and Shahriyar, R. (2021). Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. arXiv preprint arXiv:2106.13822. DOI: 10.18653/v1/2021.findings-acl.413.

Ho, X., Nguyen, A.-K. D., Sugawara, S., and Aizawa, A. (2020). Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060. DOI: 10.18653/v1/2020.coling-main.580.

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276. DOI: 10.48550/arxiv.2410.21276.

Jiang, Z., Anastasopoulos, A., Araki, J., Ding, H., and Neubig, G. (2020). X-factr: Multilingual factual knowledge retrieval from pretrained language models. arXiv preprint arXiv:2010.06189. DOI: 10.18653/v1/2020.emnlp-main.479.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., et al. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466. DOI: 10.1162/tacl_a_00276.

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. (2017). Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683. DOI: 10.18653/v1/d17-1082.

Larcher, C., Piau, M., Finardi, P., Gengo, P., Esposito, P., and Caridá, V. (2023). Cabrita: closing the gap for foreign languages. arXiv preprint arXiv:2308.11878. DOI: https://doi.org/10.48550/arxiv.2308.11878.

Li, Y., Wang, S., Ding, H., and Chen, H. (2023). Large language models in finance: A survey. In Proceedings of the fourth ACM international conference on AI in finance, pages 374-382. DOI: 10.1145/3604237.3626869.

Liang, W., Zhang, Y., Wu, Z., Lepp, H., Ji, W., Zhao, X., Cao, H., Liu, S., He, S., Huang, Z., et al. (2024). Mapping the increasing use of llms in scientific papers. arXiv preprint arXiv:2404.01268. DOI: h10.48550/arxiv.2404.01268.

Longpre, S., Singh, N., Cherep, M., Tiwary, K., Materzynska, J., Brannon, W., Mahari, R., Dey, M., Hamdy, M., Saxena, N., et al. (2024). Bridging the data provenance gap across text, speech and video. arXiv preprint arXiv:2412.17847. DOI: 10.48550/arXiv.2412.17847.

Mi, M., Villavicencio, A., and Moosavi, N. S. (2024). Rolling the dice on idiomaticity: How llms fail to grasp context. arXiv preprint arXiv:2410.16069. DOI: 10.18653/v1/2025.acl-long.362.

Moayeri, M., Tabassi, E., and Feizi, S. (2024). Worldbench: Quantifying geographic disparities in llm factual recall. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1211-1228. DOI: 10.1145/3630106.3658967.

Myung, J., Lee, N., Zhou, Y., Jin, J., Putri, R., Antypas, D., Borkakoty, H., Kim, E., Perez-Almendros, C., Ayele, A. A., et al. (2024). Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages. Advances in Neural Information Processing Systems, 37:78104-78146. DOI: 10.48550/arxiv.2406.09948.

Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. (2023). A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435. DOI: 10.1145/3744746.

Overwijk, A., Xiong, C., and Callan, J. (2022). Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pages 3360-3362. DOI: 10.1145/3477495.3536321.

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. (2021). Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193. DOI: 10.18653/v1/2022.findings-acl.165.

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese large language models. pages 226-240. DOI: 10.1007/978-3-031-45392-2_15.

Potts, C., Wu, Z., Geiger, A., and Kiela, D. (2020). Dynasent: A dynamic benchmark for sentiment analysis. arXiv preprint arXiv:2012.15349. DOI: 10.18653/v1/2021.acl-long.186.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. DOI: 10.18653/v1/d16-1264.

Rane, N. L., Tawde, A., Choudhary, S. P., and Rane, J. (2023). Contribution and performance of chatgpt and other large language models (llm) for scientific and research advancements: a double-edged sword. International Research Journal of Modernization in Engineering Technology and Science, 5(10):875-899. DOI: 10.56726/irjmets45213.

Rudinger, R., Naradowsky, J., Leonard, B., and Van Durme, B. (2018). Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301. DOI: 10.18653/v1/n18-2002.

Sayama, H. F., Araujo, A. V., and Fernandes, E. R. (2019). Faquad: Reading comprehension dataset in the domain of brazilian higher education. In 2019 8th Brazilian conference on intelligent systems (BRACIS), pages 443-448. IEEE. DOI: 10.1109/bracis.2019.00084.

Silveira, I. C. and Mauá, D. D. (2017). University entrance exam as a guiding test for artificial intelligence. In 2017 Brazilian Conference on Intelligent Systems (BRACIS), pages 426-431. IEEE. DOI: 10.1109/bracis.2017.44.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. DOI: h10.48550/arxiv.2206.04615.

Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A. (2018). Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355. DOI: 10.18653/v1/n18-1074.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. DOI: 10.48550/arxiv.2302.13971.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. DOI: 10.48550/arxiv.2307.09288.

Vargas, F. A., Carvalho, I., de Góes, F. R., Benevenuto, F., and Pardo, T. A. S. (2021). Hatebr: A large expert annotated corpus of brazilian instagram comments for offensive language and hate speech detection. arXiv preprint arXiv:2103.14972. DOI: 10.48550/arxiv.2103.14972.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32. DOI: 10.48550/arxiv.1905.00537.

Watts, I., Gumma, V., Yadavalli, A., Seshadri, V., Swaminathan, M., and Sitaram, S. (2024). Pariksha: A large-scale investigation of human-llm evaluator agreement on multilingual and multi-cultural data. arXiv preprint arXiv:2406.15053. DOI: 10.18653/v1/2024.emnlp-main.451.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. DOI: 10.48550/arxiv.2206.07682.

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. (2024). Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. DOI: 10.48550/arXiv.2412.15115.

Yu, W., Jiang, Z., Dong, Y., and Feng, J. (2020). Reclor: A reading comprehension dataset requiring logical reasoning. arXiv preprint arXiv:2002.04326. DOI: 10.48550/arxiv.2002.04326.

Zhang, P., Zeng, G., Wang, T., and Lu, W. (2024). Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385. DOI: 10.48550/arxiv.2401.02385.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2). DOI: 10.48550/arxiv.2303.18223.

Downloads

Published

2025-10-17

How to Cite

Almeida, T. S., Bonás, G. K., & Santos, J. G. A. (2025). BRoverbs - Measuring how much LLMs understand Portuguese proverbs. Journal of the Brazilian Computer Society, 31(1), 1078–1088. https://doi.org/10.5753/jbcs.2025.5797

Issue

Section

Articles