An Empirical Analysis of Large Language Models for Automated Cross-Prompt Essay Trait Scoring in Brazilian Portuguese

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5817

Keywords:

Automatic Essay Scoring, Large Language Models, Natural Language Processing

Abstract

The development of automated essay grading systems with minimal human intervention has been pursued for decades. While these systems have advanced significantly in English, there is still a lack of in-depth analysis of the use of modern Large Language Models for automatic essay scoring in Portuguese. This work addresses this gap by evaluating different language model architectures (encoder-only, decoder-only, reasoning-based), fine-tuning and prompt engineering strategies. Our study focuses on scoring argumentative essays written as practice exercises for the Brazilian national entrance exam regarding five trait-specific criteria. Our results show that no architecture is always dominant, and that encoder-only models offer a good balance between accuracy and computational cost. We obtain state-of-the-art results for the dataset, obtaining trait-specific performance that ranges from .60 to .73 measured in Quadratic Weighted Kappa.

Downloads

Download data is not yet available.

References

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. (2024a). Phi-4 technical report. arXiv preprint arXiv:2412.08905. DOI: 10.48550/arXiv.2412.08905.

Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., et al. (2024b). Phi-3 technical report: A highly capable language model locally on your phone. DOI: 10.48550/arXiv.2404.14219.

Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2025). Sabiá-3 technical report. Available at:[link].

Alikaniotis, D., Yannakoudakis, H., and Rei, M. (2016). Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715-25. DOI: 10.18653/v1/p16-1068.

Amorim, E. and Veloso, A. (2017). A multi-aspect analysis of automatic essay scoring for brazilian portuguese. In Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 94-102. DOI: 10.18653/v1/e17-4010.

Attali, Y. and Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment, 4(3). Available at:[link].

Bazelato, B. S. and Amorim, E. (2013). A bayesian classifier to automatic correction of portuguese essays. In Conferência Internacional sobre Informática na Educação (TISE), volume 18, pages 779-782. Available at:[link].

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877-1901. Curran Associates, Inc.. DOI: 10.48550/arXiv.2005.14165.

Burstein, J., Chodorow, M., and Leacock, C. (2004). Automated essay evaluation: the criterion online writing service. AI Mag., 25(3):27-36. DOI: 10.1609/aimag.v25i3.1774.

Chang, L.-H. and Ginter, F. (2024). Automatic short answer grading for finnish with chatgpt. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 23173-23181. DOI: 10.1609/aaai.v38i21.30363.

Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memory cost. CoRR, abs/1604.06174. Available at:[link].

Corrêa, N. K., Sen, A., Falk, S., and Fatimah, S. (2024). Tucano: Advancing neural text generation for portuguese. DOI: 10.1016/j.patter.2025.101325.

Courty, B., Schmidt, V., Luccioni, S., Goyal-Kamal, MarionCoutarel, Feld, B., Lecourt, J., LiamConnell, Saboni, A., Inimaz, supatomic, Léval, M., Blanche, L., Cruveiller, A., ouminasara, Zhao, F., Joshi, A., Bogroff, A., de Lavoreille, H., Laskaris, N., Abati, E., Blank, D., Wang, Z., Catovic, A., Alencon, M., Michał Stęchły, Bauer, C., de Araújo, L. O. N., JPW, and MinervaBooks (2024). mlco2/codecarbon: v2.4.1. DOI: 10.5281/zenodo.11171501.

Cummins, R. and Rei, M. (2018). Neural multi-task learning in automated assessment. arXiv preprint arXiv:1801.06830. DOI: 10.48550/arXiv.1801.06830.

Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. Available at:[link].

de la Torre, J., Puig, D., and Valls, A. (2018). Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognition Letters, pages 144-154. Machine Learning and Applications in Artificial Intelligence. DOI: 10.1016/j.patrec.2017.05.018.

de Lima, T. B., Freitas, E., and Macario, V. (2024). Aesvoting: Automatic essay scoring with bert and voting classifiers. In Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 2, pages 6-9. Available at:[link].

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., and et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. DOI: 10.48550/arXiv.2501.12948.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics. DOI: 10.48550/arXiv.1810.04805.

Doewes, A., Kurdhi, N. A., and Saxena, A. (2023). Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring. Proceedings of the 16th International Conference on Educational Data Mining, pages 103-113. International Educational Data Mining Society. Available at:[link].

Dong, F. and Zhang, Y. (2016). Automatic features for essay scoring - an empirical study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1072-1077. Available at:[link].

Dong, F., Zhang, Y., and Yang, J. (2017). Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning, pages 153-162. DOI: 10.18653/v1/k17-1017.

Fonseca, E. R., Medeiros, I., Kamikawachi, D., and Bokan, A. (2018). Automatically grading brazilian student essays. In Proceedings of International Conference on Computational Processing of the Portuguese Language, pages 170-179. DOI: 10.1007/978-3-319-99722-3_18.

Fu, Y., Peng, H., Ou, L., Sabharwal, A., and Khot, T. (2023). Specializing smaller language models towards multi-step reasoning. In International Conference on Machine Learning, pages 10421-10430. PMLR. DOI: 10.48550/arXiv.2301.12726.

Gu, Y., Dong, L., Wei, F., and Huang, M. (2023). Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543. DOI: 10.48550/arXiv.2306.08543.

Higgins, D., Burstein, J., Marcu, D., and Gentile, C. (2004). Evaluating multiple aspects of coherence in student essays. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 185-192, Boston, Massachusetts, USA. Association for Computational Linguistics. Available at:[link].

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations. Available at:[link].

Hussein, M. A., Hassan, H. A., and Nassef, M. (2020). A trait-based deep learning automated essay scoring system with adaptive feedback. International Journal of Advanced Computer Science and Applications, 11(5):287-293. DOI: 10.14569/ijacsa.2020.0110538.

Javaheripi, M. (2023). The surprising power of small language models. NeurIPS 2023 slides. Available at:[link].

Jin, C., He, B., Hui, K., and Sun, L. (2018). Tdnn: A two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1088-1097. Association for Computational Linguistics. DOI: 10.18653/v1/P18-1100.

Ke, Z. and Ng, V. (2019). Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), pages 6300-6308. DOI: 10.24963/ijcai.2019/879.

Klebanov, B. B. and Madnani, N. (2020). Automated evaluation of writing - 50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7796-7810. Available at:[link].

Leal, S. E., Duran, M. S., Scarton, C. E., Hartmann, N. S., and Aluísio, S. M. (2024). Nilc-metrix: assessing the complexity of written and spoken language in brazilian portuguese. Language Resources and Evaluation, 58(1):73-110. DOI: 10.1007/s10579-023-09693-w.

Lee, S., Cai, Y., Meng, D., Wang, Z., and Wu, Y. (2024). Unleashing large language models' proficiency in zero-shot essay scoring. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 181-198. DOI: 10.18653/v1/2024.findings-emnlp.10.

Li, S. and Ng, V. (2024). Conundrums in cross-prompt automated essay scoring: Making sense of the state of the art. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7661-7681, Bangkok, Thailand. Association for Computational Linguistics. DOI: 10.18653/v1/2024.acl-long.414.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157-173. DOI: 10.1162/tacl_a_00638.

Mansour, W. A., Albatarni, S., Eltanbouly, S., and Elsayed, T. (2024). Can large language models automatically score proficiency of written essays? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2777-2786. DOI: 10.48550/arXiv.2403.06149.

Marinho, J., Anchiêta, R., and Moura, R. (2021). Essay-br: a brazilian corpus of essays. In Anais do III Dataset Showcase Workshop, pages 53-64. DOI: 10.5753/dsw.2021.17414.

Marinho, J., Cordeiro, F., Anchiêta, R., and Moura, R. (2022). Automated essay scoring: An approach based on enem competencies. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 49-60. DOI: 10.5753/eniac.2022.227202.

Mazza Zago, R. and Agnoletti dos Santos Pedotti, L. (2024). Bertugues: A novel bert transformer model pre-trained for brazilian portuguese. Semina: Ciências Exatas e Tecnológicas, 45:e50630. DOI: 10.5433/1679-0375.2024.v45.50630.

Mei, W. S. (2006). Creating a contrastive rhetorical stance: Investigating the strategy of problematization in students' argumentation. RELC journal, 37(3):329-353. DOI: 10.1177/0033688206071316.

Mello, R. F., Oliveira, H., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotanif, S. (2024). Propor'24 competition on automatic essay scoring of portuguese narrative essays. In Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 2, pages 1-5. Available at:[link].

OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., and et al. (2024a). Openai o1 system card. DOI: 10.48550/arxiv.2412.16720.

OpenAI (2022). Introducing chatgpt. ChatGPT. Available at:[link].

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., and et al. (2024b). Gpt-4 technical report. DOI: 10.48550/arxiv.2303.08774.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730-27744. DOI: 10.48550/arXiv.2203.02155.

Page, E. B. (1966). The imminence of... grading essays by computer. The Phi Delta Kappan, pages 238-243. Book.

Persing, I., Davis, A., and Ng, V. (2010). Modeling organization in student essays. In Li, H. and Màrquez, L., editors, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 229-239, Cambridge, MA. Association for Computational Linguistics. Available at:[link].

Persing, I. and Ng, V. (2013). Modeling thesis clarity in student essays. In Schuetze, H., Fung, P., and Poesio, M., editors, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 260-269, Sofia, Bulgaria. Association for Computational Linguistics. Available at:[link].

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training. Available at:[link].

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485-5551. DOI: 10.48550/arxiv.1910.10683.

Ramnarain-Seetohul, V., Bassoo, V., and Rosunally, Y. (2022). Similarity measures in automated essay scoring systems: A ten-year review. Education and Information Technologies, 27(4):5573-5604. DOI: 10.1007/s10639-021-10838-z.

Raschka, S. (2023). Practical tips for finetuning llms using lora (low-rank adaptation). Available at:[link] Substack article, accessed 7 July 2025.

Ribeiro, E., Mamede, N., and Baptista, J. (2024). Exploring the automated scoring of narrative essays in brazilian portuguese using transformer models. In Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 2, pages 14-17. Available at:[link].

Ridley, R., He, L., Dai, X., Huang, S., and Chen, J. (2020). Prompt agnostic essay scorer: A domain generalization approach to cross-prompt automated essay scoring. Available at:[link].

Ridley, R., He, L., Dai, X.-y., Huang, S., and Chen, J. (2021). Automated cross-prompt scoring of essay traits. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13745-13753. DOI: 10.1609/aaai.v35i15.17620.

Rodriguez, P. U., Jafari, A., and Ormerod, C. M. (2019). Language models and automated essay scoring.

Ruder, S. (2018). Nlp's imagenet moment has arrived. Available at:[link].

Santos, R., Rodrigues, J., Gomes, L., Silva, J., Branco, A., Cardoso, H. L., Osário, T. F., and Leite, B. (2024). Fostering the ecosystem of open neural encoders for portuguese with albertina pt-* family. DOI: 10.48550/arXiv.2403.01897.

Schick, T. and Schütze, H. (2021). It's not just size that matters: Small language models are also few-shot learners. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y., editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339-2352, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.naacl-main.185.

Schneer, D. (2014). Rethinking the argumentative essay. Tesol Journal, 5(4):619-653. DOI: 10.1002/tesj.123.

Silveira, I. C., Barbosa, A., da Costa, D. S. L., and Mauá, D. D. (2025). Investigating universal adversarial attacks against transformers-based automatic essay scoring systems. In Paes, A. and Verri, F. A. N., editors, Intelligent Systems, pages 169-183, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-79032-4_12.

Silveira, I. C., Barbosa, A., and Mauá, D. D. (2024). A new benchmark for automatic essay scoring in portuguese. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 228-237. Available at:[link].

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear). DOI: 10.1007/978-3-030-61377-8_28.

Taghipour, K. and Ng, H. T. (2016). A neural approach to automated essay scoring. In Su, J., Duh, K., and Carreras, X., editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882-1891, Austin, Texas. Association for Computational Linguistics. DOI: 10.18653/v1/D16-1193.

Tay, Y., Phan, M., Tuan, L. A., and Hui, S. C. (2018). Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring. In Proceedings of the AAAI conference on artificial intelligence, volume 32. DOI: 10.1609/aaai.v32i1.12045.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziãne, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models. DOI: 10.48550/arXiv.2302.13971.

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J., and Poli, I. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. DOI: 10.18653/v1/2025.acl-long.127.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. Available at:[link].

Williamson, D. M., Xi, X., and Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1):2-13. DOI: 10.1111/j.1745-3992.2011.00223.x.

Yang, K., Rakovíc, M., Li, Y., Guan, Q., Gavsevíc, D., and Chen, G. (2024). Unveiling the tapestry of automated essay scoring: A comprehensive investigation of accuracy, fairness, and generalizability. In Proceedings of the aaai conference on artificial intelligence, volume 38, pages 22466-22474. DOI: 10.1609/aaai.v38i20.30254.

Downloads

Published

2025-10-06

How to Cite

Barbosa, A., Silveira, I. C., & Mauá, D. D. (2025). An Empirical Analysis of Large Language Models for Automated Cross-Prompt Essay Trait Scoring in Brazilian Portuguese. Journal of the Brazilian Computer Society, 31(1), 858–871. https://doi.org/10.5753/jbcs.2025.5817

Issue

Section

Articles