Sequence Labeling in Product Descriptions on Invoices: Comparing LLM-based settings with a CRF baseline

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5743

Keywords:

Sequence Labeling, Named Entity Recognition, Invoices, Large Language Models, BERT, Conditional Random Fields, Brazilian Portuguese

Abstract

Electronic invoices are present in most commercial transactions since several countries require their issue in the purchase, sale, and transportation of goods. The accurate identification of elements within these invoices is crucial for governmental oversight, aiding in tasks such as detecting overpricing in public contracts. However, this identification is a challenge due to the diversity of products, as well as variations and errors in filling out the information. This article aims to compare the performance of a model developed using a traditional Conditional Random Fields (CRF) technique for the task with models based on large language models adapted for this task. The goal is to assess whether language models can be effectively used to improve the performance in this scenario. The paper assesses the use of several modeling approaches, including the influence of language in the base model (Portuguese-specific vs. Multilingual BERT), as well as alternatives for the classification head (fine-tuning with a linear layer vs. feature-extraction with BiLSTM and a linear layer, with or without a CRF layer). The best model, which combines a Portuguese BERT-based approach with a Conditional Random Fields layer, achieves an F1-score improvement of approximately 4% over the baseline model that relies solely on CRF. The tests used data from invoices issued in Brazil in 2021 in the context of public contracts.

Downloads

Download data is not yet available.

References

Alshammari, N. and Alanazi, S. (2021). The impact of using different annotation schemes on named entity recognition. Egyptian Informatics Journal, 22(3):295-302. DOI: 10.1016/j.eij.2020.10.004.

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610-623. DOI: 10.1145/3442188.3445922.

Bose, P., Srinivasan, S., Sleeman, W. C., Palta, J., Kapoor, R., and Ghosh, P. (2021). A survey on recent named entity recognition and relationship extraction techniques on clinical texts. Applied Sciences, 11(18):8319. DOI: 10.3390/app11188319.

Cao, P., Wang, Y., Zhang, Q., and Meng, Z. (2023). GenKIE: Robust generative multimodal document key information extraction. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14620-14631. Association for Computational Linguistics. Available at: [link].

Darrazão, E., Amorim, V., Oliveira, K., and Gomes-Jr, L. (2023). Engenharia e avaliação de features para extração de informação em notas fiscais. pages 80-89. DOI: 10.5753/erbd.2023.229441.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423.

F. N. de Oliveira, L. P. G. d. S. (2020). Estratégias para combater a sonegação fiscal: Um modelo para o icms baseado em redes neurais artificiais. Revista de Gestão, Finanças e Contabilidade, 10:42-64. DOI: 10.18028/rgfc.v10i1.7474.

Ganiz, M. C., Celik, M., Celikmasat, G., Aydin, G., and Yuret, D. (2022). Biomedical named entity recognition using transformers with bilstm+crf and graph convolutional neural networks. In 2022 16th International Conference on Innovations in Intelligent Systems and Applications (INISTA), pages 1-6. IEEE. DOI: 10.1109/INISTA55331.2022.9872223.

He, Z., Wang, Z., Wei, W., Feng, S., Mao, X., and Jiang, S. (2020). A survey on recent advances in sequence labeling from deep learning models. DOI: 10.48550/arxiv.2011.06727.

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8):1735-1780. DOI: 10.1162/neco.1997.9.8.1735.

Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Available at: [link].

John Lafferty, Andrew McCallum, F. C. P. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. ICML. Available at:[link].

Jurafsky, D. and Martin, J. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, volume 2. Prentice Hall. DOI: 10.1162/coli.B09-001.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. DOI: 10.48550/arXiv.1301.3781.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar. Association for Computational Linguistics. DOI: 10.3115/v1/D14-1162.

Pereira, R. d. S. (2020). Redes heterogêneas para classificação de produtos em notas fiscais eletrônicas de compras públicas [tcc]. CGU. Available at:[link].

Radford, A. and Narasimhan, K. (2018). Improving language understanding by generative pre-training. Available at:[link].

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. Available at:[link].

Seymore, K. and Rosenfeld, R. (1999). Learning hidden markov model structure for information extraction. Available at:[link].

Souza, F., Nogueira, R., and Lotufo, R. (2019). Portuguese named entity recognition using bert-crf. arXiv preprint arXiv:1909.10649. DOI: 10.48550/arxiv.1909.10649.

Souza, F., Nogueira, R., and Lotufo, R. (2020a). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403-417, Cham. Springer International Publishing. DOI: 10.1007/978-3-030-61377-8_28.

Souza, F., Nogueira, R., and Lotufo, R. (2020b). Portuguese named entity recognition using BERT-CRF. In Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR), pages 304-313. Springer. Pre-print available as arXiv:1909.10649 [cs.CL]. DOI: 10.48550/arxiv.1909.10649.

Sutton, C., McCallum, A., et al. (2012). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267-373. DOI: 10.48550/arXiv.1011.4088.

Tjong Kim Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proc. Conference on Natural Language Learning at HLT-NAACL. Available at:[link].

Tourille, J., Doutreligne, M., Ferret, O., Névéol, A., Paris, N., and Tannier, X. (2018). Evaluation of a sequence tagging tool for biomedical texts. In Proc. International Workshop on Health Text Mining and Information Analysis. DOI: 10.18653/v1/W18-5622.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. DOI: 10.48550/arXiv.1706.03762.

Veras Carvalho Menezes, A. P. (2022). Inteligência artificial para identificação de indícios de fraude e corrupção em compras públicas no tcu. Revista Debates em Administração Pública – REDAP, 3(2). Available at:[link].

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Łukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. DOI: 10.48550/arXiv.1609.08144.

Downloads

Published

2025-10-24

How to Cite

Darrazão, E., Oliveira, K., & Gomes-Jr, L. C. (2025). Sequence Labeling in Product Descriptions on Invoices: Comparing LLM-based settings with a CRF baseline. Journal of the Brazilian Computer Society, 31(1), 1203–1212. https://doi.org/10.5753/jbcs.2025.5743

Issue

Section

Articles