Reconhecimento de Entidades Nomeadas em Português: Comparação de Modelos Pré-Treinados com Fine-Tuning

Guilherme Tapajós; Tiago de Melo; Elloá B. Guedes; Fábio Santos

doi:10.5753/reic.2025.6077

Authors

Guilherme Tapajós Universidade do Estado do Amazonas https://orcid.org/0009-0002-0547-9404
Tiago de Melo Universidade do Estado do Amazonas https://orcid.org/0000-0002-3158-2299
Elloá B. Guedes Universidade do Estado do Amazonas https://orcid.org/0000-0002-7264-701X
Fábio Santos Universidade do Estado do Amazonas https://orcid.org/0000-0003-1067-8571

DOI:

https://doi.org/10.5753/reic.2025.6077

Keywords:

Reconhecimento de Entidades Nomeadas, PLN, Modelos Pré-treinados, Fine-tuning

Abstract

O Reconhecimento de Entidades Nomeadas (REN) é uma tarefa central do Processamento de Linguagem Natural (PLN), mas ainda limitada no português pela escassez de recursos. Este estudo analisa o fine‑tuning de variantes base, largas e destiladas dos modelos BERT e RoBERTa, em configurações multilíngues e monolíngues, usando os conjuntos Harem, LeNER‑Br e GeoCorpus. O XLM‑RoBERTa‑large obteve F1‑scores de 83,8% no Harem e 92,3% no LeNER‑Br, enquanto o BERT‑large‑cased alcançou 87,8% no GeoCorpus, superando as baselines em até cinco pontos percentuais. Modelos multilíngues mostraram melhor adaptabilidade e as versões destiladas mantiveram desempenho competitivo com menor custo computacional. Os resultados evidenciam que o ajuste fino de grandes modelos pré‑treinados é uma estratégia eficaz para impulsionar o REN em português.

Descargas

Los datos de descargas todavía no están disponibles.

Citas

Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., da Silva, N. F. F., Vitório, D., Moriyama, G., Martins, L., Soezima, L., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., Dias, M., Silva, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., and Oliveira, A. L. I. (2022). UlyssesNER-BR: A corpus of Brazilian legislative documents for named entity recognition. In Computational Processing of the Portuguese Language. Springer, Cham. DOI: 10.1007/978-3-030-98305-5_1.

Amaral, D. O. F. (2017). Reconhecimento de entidades nomeadas na área da Geologia: bacias sedimentares brasileiras. PhD thesis, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brasil. Disponível em: [link].

Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. DOI: 10.18653/v1/D19-1371.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Transactions of the Association for Computational Linguistics, 8:63–77. Disponível em: [link].

da Silva, M. G. and de Oliveira, H. T. A. (2022). Combining word embeddings for Portuguese named entity recognition. In Computational Processing of the Portuguese Language, pages 198–208. Springer, Cham. DOI: 10.1007/978-3-030-98305-5_19.

de Almeida Neto, J. A. and de Melo, T. (2023). Exploring supervised learning models for multi-label text classification in Brazilian restaurant reviews. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 126–140. SBC. DOI: 10.5753/eniac.2023.233843.

de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-BR: A dataset for named entity recognition in Brazilian legal text. In Computational Processing of the Portuguese Language, pages 313–323. Springer, Cham. DOI: 10.1007/978-3-319-99722-3_32.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT, 1:4171–4186. DOI: 10.18653/v1/N19-1423.

Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., and Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint. Disponível em: [link].

Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q. (2020). TinyBERT: Distilling BERT for natural language understanding. arXiv preprint. Disponível em: [link].

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint. DOI: 10.48550/arXiv.1907.11692.

Matos, E., Rodrigues, M., and Teixeira, A. (2024). Towards the automatic creation of NER systems for new domains. In 16th International Conference on Computational Processing of Portuguese Language (PROPOR 2024), pages 218–227. Disponível em: [link].

Pereira, D. A. (2021). A survey of sentiment analysis in the Portuguese language. Artificial Intelligence Review, 54(2):1087–1115. DOI: 10.1007/s10462-020-09870-1.

Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual BERT? In ACL. Disponível em: [link].

Rodrigues, R. B. M., Privatto, P. I. M., de Sousa, G. J., Murari, R. P., Afonso, L. C. S., Papa, J. P., Pedronette, D. C. G., Guilherme, I. R., Perrout, S. R., and Riente, A. F. (2022). PetroBERT: A domain adaptation language model for oil and gas applications in Portuguese. In Computational Processing of the Portuguese Language, pages 101–109. Springer, Cham. DOI: 10.1007/978-3-030-98305-5_10.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS. Disponível em: [link].

Santos, D., Seco, N., Cardoso, N., and Vilela, R. (2006). HAREM: An advanced NER evaluation contest for Portuguese. In 5th International Conference on Language Resources and Evaluation (LREC 2006). Disponível em: [link].

Santos, J., Vieira, R., Olival, F., Cameron, H., and Farrica, F. (2024). Named entity recognition specialised for Portuguese 18th-century history research. In PROPOR 2024, pages 117–126. Disponível em: [link].

Souza, F. C., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In Intelligent Systems, pages 403–417. Springer, Cham. DOI: 10.1007/978-3-030-61377-8_28.

Tjong Kim Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147. Disponível em: [link].

Wang, Y., Tong, H., Zhu, Z., and Li, Y. (2022). Nested named entity recognition: A survey. ACM Transactions on Knowledge Discovery from Data. DOI: 10.1145/3522593.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45. Disponível em: [link].

Zerbinati, M. M., Roman, N. T., and Di Felippo, A. (2024). A corpus of stock market tweets annotated with named entities. In PROPOR 2024, pages 276–284. Disponível em: [link].

Zilio, L., Lazzari, R. R., and Finatto, M. J. B. (2024). NLP for historical Portuguese: Analysing 18th-century medical texts. In PROPOR 2024, pages 76–85. Disponível em: [link].

Reconhecimento de Entidades Nomeadas em Português: Comparação de Modelos Pré-Treinados com Fine-Tuning

Authors

DOI:

Keywords:

Abstract

Descargas

Citas

Descargas

Published

Cómo citar

Issue

Section

Licencia

Enviar un artículo

Idioma