Rating Prediction in Brazilian Portuguese: A Benchmark of Large Language Models

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5667

Keywords:

Rating Prediction, Large Language Models, Natural Language Processing

Abstract

This study evaluates the performance of Large Language Models (LLMs) in predicting ratings for Brazilian Portuguese user reviews. We benchmark ten LLMs, including ChatGPT-3.5, ChatGPT-4o, DeepSeek, Mistral, LLaMA (3, 3.3), Gemma (1, 2), and the Brazilian Portuguese-specific models Sabiá-3 and Sabiazinho, using two prompting strategies: simple (p1) and detailed (p2). Results indicate that ChatGPT-4o and DeepSeek achieved the highest accuracy, particularly in predicting extreme ratings (1 and 5 stars). Sabiá-3 also performed competitively, highlighting the potential of language-specific models. Models performed better in objective categories such as food and baby products but struggled with more subjective domains like automotive and games. Cost analysis showed that DeepSeek is a more cost-effective alternative to ChatGPT-4o while maintaining similar accuracy. This study provides a systematic benchmark of LLMs for rating prediction in Brazilian Portuguese, offering insights into their effectiveness and limitations.

Downloads

Download data is not yet available.

References

Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá-3 technical report. arXiv preprint arXiv:2410.12049. DOI: 10.48550/arXiv.2410.12049.

Ahmed, B. H. and Ghabayen, A. S. (2022). Review rating prediction framework using deep learning. Journal of Ambient Intelligence and Humanized Computing, 13(7):3423-3432. DOI: 10.1007/s12652-020-01807-4.

Al Nazi, Z., Hossain, M. R., and Al Mamun, F. (2025). Evaluation of open and closed-source llms for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Natural Language Processing Journal, page 100124. DOI: 10.1016/j.nlp.2024.100124.

Asghar, N. (2016). Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362. DOI: 10.48550/arXiv.1605.05362.

Barman, K. D., Bordoloi, B., Kumar, A., and Halder, A. (2024). Review rating predictions using improved deep learning architecture. In 2024 IEEE 16th International Conference on Computational Intelligence and Communication Networks (CICN), pages 468-472. IEEE. DOI: 10.1109/cicn63059.2024.10847509.

Chambua, J. and Niu, Z. (2021). Review text based rating prediction approaches: preference knowledge learning, representation and utilization. Artificial Intelligence Review, 54:1171-1200. DOI: 10.1007/s10462-020-09873-y.

de Araujo, G., de Melo, T., and Figueiredo, C. M. S. (2024). Is chatgpt an effective solver of sentiment analysis tasks in portuguese? a preliminary study. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 13-21. Available at:[link].

de Melo, T., da Silva, A. S., de Moura, E. S., and Calado, P. (2019). Opinionlink: Leveraging user opinions for product catalog enrichment. Information Processing & Management, 56(3):823-843. DOI: 10.1016/j.ipm.2019.01.004.

DeepSeek-AI (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954. Available at:[link].

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783. DOI: 10.48550/arXiv.2407.21783.

Feng, W. and Yan, J. (2024). Language abstraction in negative online customer reviews: The choice of corporate response strategy and voice. SAGE Open, 14(2):21582440241240561. DOI: 10.1177/21582440241240561.

Hanić, S., Bagić Babac, M., Gledec, G., and Horvat, M. (2024). Comparing machine learning models for sentiment analysis and rating prediction of vegan and vegetarian restaurant reviews. Computers, 13(10):248. DOI: 10.3390/computers13100248.

Hossain, M. I., Rahman, M., Ahmed, M. T., Rahman, M. S., and Islam, A. T. (2021). Rating prediction of product reviews of bangla language using machine learning algorithms. In 2021 International Conference on Artificial Intelligence and Mechatronics Systems (AIMS), pages 1-6. IEEE. DOI: 10.1109/aims52415.2021.9466022.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825. DOI: 10.48550/arXiv.2310.06825.

Kang, W.-C., Ni, J., Mehta, N., Sathiamoorthy, M., Hong, L., Chi, E., and Cheng, D. Z. (2023). Do llms understand user preferences? evaluating llms on user rating prediction. arXiv preprint arXiv:2305.06474. DOI: 10.48550/arXiv.2305.06474.

Liu, B. and Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In Mining text data, pages 415-463. Springer. DOI: 10.1007/978-1-4614-3223-4_13.

Liu, F., Liu, Y., Chen, H., Cheng, Z., Nie, L., and Kankanhalli, M. (2025). Understanding before recommendation: Semantic aspect-aware review exploitation via large language models. ACM Transactions on Information Systems, 43(2):1-26. DOI: 10.1145/3704999.

Pereira, D. A. (2021). A survey of sentiment analysis in the portuguese language. Artificial Intelligence Review, 54(2):1087-1115. DOI: 10.1007/s10462-020-09870-1.

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., et al. (2023). Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950. DOI: 10.48550/arXiv.2308.12950.

Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., and Chadha, A. (2024). A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927. DOI: 10.48550/arxiv.2402.07927.

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., et al. (2024a). Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. DOI: 10.48550/arxiv.2403.08295.

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al. (2024b). Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118. DOI: 10.48550/arXiv.2408.00118.

Wang, Q., Zhang, W., Li, J., Mai, F., and Ma, Z. (2022). Effect of online review sentiment on product sales: The moderating role of review credibility perception. Computers in Human Behavior, 133:107272. DOI: 10.1016/j.chb.2022.107272.

Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., Shen, Y., et al. (2023). A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420. DOI: 10.48550/arxiv.2303.10420.

Zhang, X., Li, Y., Wang, J., Sun, B., Ma, W., Sun, P., and Zhang, M. (2024). Large language models as evaluators for recommendation explanations. In Proceedings of the 18th ACM Conference on Recommender Systems, pages 33-42. DOI: 10.1145/3640457.3688075.

Downloads

Published

2025-10-03

How to Cite

Marreira, E., de Melo, T., de Oliveira, M., & Figueiredo, C. M. S. (2025). Rating Prediction in Brazilian Portuguese: A Benchmark of Large Language Models. Journal of the Brazilian Computer Society, 31(1), 828–839. https://doi.org/10.5753/jbcs.2025.5667

Issue

Section

Articles