Named Entity Recognition in Portuguese: Comparison of Pre-Trained Models with Fine-Tuning

Authors

DOI:

https://doi.org/10.5753/reic.2025.6077

Keywords:

Named Entity Recognition, NLP, Pre-trained Models, Fine-tuning

Abstract

Named Entity Recognition (NER) is a core task in Natural Language Processing (NLP), yet its progress in Portuguese is still hindered by scarce resources. This study examines the fine-tuning of base, large and distilled variants of BERT and RoBERTa models, in both multilingual and monolingual configurations, using the Harem, LeNER-Br and GeoCorpus datasets. XLM-RoBERTa-large achieved F1 scores of 83.8% on Harem and 92.3% on LeNER-Br, while BERT-large-cased reached 87.8% on GeoCorpus, outperforming the baselines by up to five percentage points. Multilingual models showed greater adaptability, and distilled versions maintained competitive performance with lower computational cost. The results demonstrate that fine-tuning large pre-trained models is an effective strategy for advancing NER in Portuguese.

Downloads

Download data is not yet available.

References

Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., da Silva, N. F. F., Vitório, D., Moriyama, G., Martins, L., Soezima, L., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., Dias, M., Silva, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., and Oliveira, A. L. I. (2022). UlyssesNER-BR: A corpus of Brazilian legislative documents for named entity recognition. In Computational Processing of the Portuguese Language. Springer, Cham. DOI: 10.1007/978-3-030-98305-5_1.

Amaral, D. O. F. (2017). Reconhecimento de entidades nomeadas na área da Geologia: bacias sedimentares brasileiras. PhD thesis, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brasil. Disponível em: [link].

Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. DOI: 10.18653/v1/D19-1371.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Transactions of the Association for Computational Linguistics, 8:63–77. Disponível em: [link].

da Silva, M. G. and de Oliveira, H. T. A. (2022). Combining word embeddings for Portuguese named entity recognition. In Computational Processing of the Portuguese Language, pages 198–208. Springer, Cham. DOI: 10.1007/978-3-030-98305-5_19.

de Almeida Neto, J. A. and de Melo, T. (2023). Exploring supervised learning models for multi-label text classification in Brazilian restaurant reviews. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 126–140. SBC. DOI: 10.5753/eniac.2023.233843.

de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-BR: A dataset for named entity recognition in Brazilian legal text. In Computational Processing of the Portuguese Language, pages 313–323. Springer, Cham. DOI: 10.1007/978-3-319-99722-3_32.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT, 1:4171–4186. DOI: 10.18653/v1/N19-1423.

Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., and Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint. Disponível em: [link].

Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q. (2020). TinyBERT: Distilling BERT for natural language understanding. arXiv preprint. Disponível em: [link].

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint. DOI: 10.48550/arXiv.1907.11692.

Matos, E., Rodrigues, M., and Teixeira, A. (2024). Towards the automatic creation of NER systems for new domains. In 16th International Conference on Computational Processing of Portuguese Language (PROPOR 2024), pages 218–227. Disponível em: [link].

Pereira, D. A. (2021). A survey of sentiment analysis in the Portuguese language. Artificial Intelligence Review, 54(2):1087–1115. DOI: 10.1007/s10462-020-09870-1.

Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual BERT? In ACL. Disponível em: [link].

Rodrigues, R. B. M., Privatto, P. I. M., de Sousa, G. J., Murari, R. P., Afonso, L. C. S., Papa, J. P., Pedronette, D. C. G., Guilherme, I. R., Perrout, S. R., and Riente, A. F. (2022). PetroBERT: A domain adaptation language model for oil and gas applications in Portuguese. In Computational Processing of the Portuguese Language, pages 101–109. Springer, Cham. DOI: 10.1007/978-3-030-98305-5_10.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS. Disponível em: [link].

Santos, D., Seco, N., Cardoso, N., and Vilela, R. (2006). HAREM: An advanced NER evaluation contest for Portuguese. In 5th International Conference on Language Resources and Evaluation (LREC 2006). Disponível em: [link].

Santos, J., Vieira, R., Olival, F., Cameron, H., and Farrica, F. (2024). Named entity recognition specialised for Portuguese 18th-century history research. In PROPOR 2024, pages 117–126. Disponível em: [link].

Souza, F. C., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In Intelligent Systems, pages 403–417. Springer, Cham. DOI: 10.1007/978-3-030-61377-8_28.

Tjong Kim Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147. Disponível em: [link].

Wang, Y., Tong, H., Zhu, Z., and Li, Y. (2022). Nested named entity recognition: A survey. ACM Transactions on Knowledge Discovery from Data. DOI: 10.1145/3522593.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45. Disponível em: [link].

Zerbinati, M. M., Roman, N. T., and Di Felippo, A. (2024). A corpus of stock market tweets annotated with named entities. In PROPOR 2024, pages 276–284. Disponível em: [link].

Zilio, L., Lazzari, R. R., and Finatto, M. J. B. (2024). NLP for historical Portuguese: Analysing 18th-century medical texts. In PROPOR 2024, pages 76–85. Disponível em: [link].

Published

2025-07-11

How to Cite

Tapajós, G., de Melo, T., Guedes, E. B., & Santos, F. (2025). Named Entity Recognition in Portuguese: Comparison of Pre-Trained Models with Fine-Tuning. Electronic Journal of Undergraduate Research on Computing, 23(1), 111–117. https://doi.org/10.5753/reic.2025.6077

Issue

Section

Full Papers