HelBERT: A BERT-Based Pretraining Model for Public Procurement Tasks in Portuguese

Weslley Emmanuel Martins Lima; Victor Ribeiro da Silva; Jasson Carvalho da Silva; Ricardo de Andrade Lira Rabêlo; Anselmo Cardoso de Paiva

doi:10.5753/jbcs.2026.5511

Authors

Weslley Emmanuel Martins Lima Federal University of Piauí https://orcid.org/0000-0003-1062-9876
Victor Ribeiro da Silva Federal University of Piauí https://orcid.org/0009-0007-3594-8853
Jasson Carvalho da Silva Federal University of Piauí https://orcid.org/0009-0009-6811-2264
Ricardo de Andrade Lira Rabêlo Federal University of Piauí https://orcid.org/0000-0003-1482-6404
Anselmo Cardoso de Paiva Federal University of Maranhão https://orcid.org/0000-0003-4921-0626

DOI:

https://doi.org/10.5753/jbcs.2026.5511

Keywords:

BERT, NLP, Pretrain, Public procurement

Abstract

Deep learning models excel in various tasks but require extensive annotated data for supervised learning. In NLP, limited annotated data hinders deep learning. Self-supervised pretraining addresses this by training models on unlabeled text to learn useful representations. Domain-specific pretraining is crucial for good performance in downstream tasks. Although pretrained BERT models exist for legal documents in some languages, none target public procurement documents in Portuguese. Public procurement documents have terminology that is not found in existing models. In this paper, we propose HelBERT, a BERT-based model pretrained on a large corpus of public procurement documents in the Brazilian Portuguese language, including laws, tender notices, and contracts. The experimental results demonstrate that HelBERT outperforms other models in all analyses. HelBERT surpasses models such as BERTimbau and JurisBERT in classification tasks by achieving improvements of 5% and 4% in the F1 Score, respectively. Furthermore, the model achieves gains that exceed 3% in semantic similarity tasks compared to the baseline models. Moreover, despite using a GPU with reduced memory and processing resources, the proposed approach achieves superior results with fewer and more efficient training epochs than the baseline models. These findings underscore the effectiveness of the proposed model in addressing NLP tasks within the public procurement domain.

Downloads

Download data is not yet available.

References

Al-qurishi, M., Alqaseemi, S., and Souissi, R. (2022). AraLegal-BERT: A pretrained language model for Arabic legal text. In Aletras, N., Chalkidis, I., Barrett, L., Goantextcommabelowtă, C., and Preotextcommabelowtiuc-Pietro, D., editors, Proceedings Of The Natural Legal Language Processing Workshop 2022, pages 338-344. Association for Computational Linguistics. DOI: 10.18653/v1/2022.nllp-1.31.

Alamoudi, E. and Alghamdi, N. (2021). Sentiment classification and aspect-based sentiment analysis on yelp reviews using deep learning and word embeddings. Journal Of Decision Systems, 30:259-281. DOI: 10.1080/12460125.2020.1864106.

Alatawi, H., Alhothali, A., and Moria, K. (2021). Detecting white supremacist hate speech using domain specific word embedding with deep learning and bert. IEEE Access, 9:106363-106374. DOI: 10.1109/ACCESS.2021.3100435.

Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., and McDermott, M. (2019). Publicly available clinical BERT embeddings. In Rumshisky, A., Roberts, K., Bethard, S., and Naumann, T., editors, Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72-78. Association for Computational Linguistics. DOI: 10.18653/v1/W19-1909.

Argilla., S. L. U. (2024). Open-source data curation platform for llms. Available at:[link].

Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., and Auli, M. (2019). Cloze-driven pre-training of self-attention networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings Of The Conference On Empirical Methods In Natural Language Processing And The 9th International Joint Conference On Natural Language Processing (EMNLP-IJCNLP), pages 5360-5369, Hong Kong, China. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1539.

Bambroo, P. and Awasthi, A. (2021). Legaldb: Long distilbert for legal document classification. In 2021 International Conference On Advances In Electrical, Computing, Communication And Sustainable Technologies (ICAECT), pages 1-4. DOI: 10.1109/ICAECT49130.2021.9392558.

Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615-3620, Hong Kong, China. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1371.

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. (2020). Legal-bert: The muppets straight out of law school. In Cohn, T., He, Y., and Liu, Y., editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898-2904, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.261.

Condevaux, C. and Harispe, S. (2023). Lsg attention: Extrapolation of pretrained transformers to long sequences. In Advances in Knowledge Discovery and Data Mining: 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2023, Osaka, Japan, May 25–28, 2023, Proceedings, Part I, page 443–454, Berlin, Heidelberg. Springer-Verlag. DOI: 10.1007/978-3-031-33374-3_35.

Conneau, A. and Lample, G. (2019). Cross-lingual language model pre-training. In Proceedings Of The 33rd International Conference On Neural Information Processing Systems, page 11, Red Hook, NY, USA. Curran Associates Inc.. DOI: 10.5555/3454287.3454921.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings Of The 2019 Conference Of The North American Chapter Of The Association For Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long And Short Papers), pages 4171-4186, Minneapolis, MN, USA. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423.

Douka, S., Abdine, H., Vazirgiannis, M., El Hamdani, R., and Restrepo Amariles, D. (2021). JuriBERT: A masked-language model adaptation for French legal text. In Aletras, N., Androutsopoulos, I., Barrett, L., Goanta, C., and Preotiuc-Pietro, D., editors, Proceedings Of The Natural Legal Language Processing Workshop 2021, pages 95-101, Punta Cana, Dominican Republic. Association for Computational Linguistics. DOI: 10.18653/v1/2021.nllp-1.9.

Fesseha, A., Xiong, S., Emiru, E. D., Diallo, M., and Dahou, A. (2021). Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information, 12(2). DOI: 10.3390/info12020052.

Firth, J. (1957). A Synopsis of Linguistic Theory,1930-1955. Oxford Blackwell. Book.

García-Díaz, J., Cánovas-García, M., Colomo-Palacios, R., and Valencia-García, R. (2021). Detecting misogyny in spanish tweets. an approach based on linguistics features and word embeddings. Future Generation Computer Systems, 114:506-518. DOI: 10.1016/j.future.2020.08.032.

GitHub, C.-A. (2024). ccdv-ai/convert_checkpoint_to_lsg: Efficient attention for long sequence processing. Available at:[link].

Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Gurevych, I. and Miyao, Y., editors, Proceedings Of The 56th Annual Meeting Of The Association For Computational Linguistics (Volume 1: Long Papers), pages 328-339, Melbourne, Australia. Association for Computational Linguistics. DOI: 10.18653/v1/P18-1031.

Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., and Levy, O. (2020). SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8:64-77. DOI: 10.1162/tacl_a_00300.

Khurana, D., Koli, A., Khatter, K., and Singh, S. (2023). Natural language processing: state of the art, current trends and challenges. Multimedia Tools And Applications, 82(3):3713-3744. DOI: 10.1007/s11042-022-13428-4.

Lawrence, S. and Giles, C. (2000). Overfitting and neural networks: conjugate gradient and backpropagation. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, volume 1, pages 114-119 vol.1. DOI: 10.1109/IJCNN.2000.857823.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2019). Biobert: a pretrained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234-1240. DOI: 10.1093/bioinformatics/btz682.

Licari, D. and Comandè, G. (2022). Italian-legal-bert: A pretrained transformer language model for italian law. In Symeonidou, D., Yu, R., Ceolin, D., Poveda-Villalón, M., Audrito, D., Caro, L. D., Grasso, F., Nai, R., Sulis, E., Ekaputra, F. J., Kutz, O., and Troquard, N., editors, Companion Proceedings Of The 23rd International Conference On Knowledge Engineering And Knowledge Management, Bozen-Bolzano, Italy. CEUR. ISSN: 1613-0073. DOI: 10.1016/j.clsr.2023.105908.

Lima, W., Silva, V., Silva, J., Lira, R., and Paiva, A. (2025). Bidcorpus: A multifaceted learning dataset for public procurement. Data in Brief, 58:111202. DOI: 10.1016/j.dib.2024.111202.

Liu, Z., Huang, D., Huang, K., Li, Z., and Zhao, J. (2020). Finbert: A pretrained financial language representation model for financial text mining. In Bessiere, C., editor, Proceedings Of The Twenty-Ninth International Joint Conference On Artificial Intelligence, IJCAI-20, pages 4513-4519. International Joint Conferences on Artificial Intelligence Organization. Special Track on AI in FinTech. DOI: 10.24963/ijcai.2020/622.

Miaschi, A., Brunato, D., Dell'Orletta, F., and Venturi, G. (2021). What makes my model perplexed? a linguistic investigation on neural language models perplexity. In Agirre, E., Apidianaki, M., and Vulić, I., editors, Proceedings Of Deep Learning Inside Out (DeeLIO): The 2nd Workshop On Knowledge Extraction And Integration For Deep Learning Architectures, pages 40-47, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.deelio-1.5.

O. Silva, M., P. Oliveira, G., Hott, H., D. Gomide, L., M. A. Mendes, B., A. Bacha, C., L. Costa, L., A. Brandão, M., Lacerda, A., and L. Pappa, G. (2024). Lipset: A comprehensive dataset of labeled portuguese public bidding documents. Journal of Information and Data Management, 15(1):196–205. DOI: 10.5753/jidm.2024.3460.

Onan, A. (2021). Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. Concurrency And Computation: Practice And Experience, 33(23):e5909. e5909 CPE-20-0130.R1. DOI: 10.1002/cpe.5909.

Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., and Dehak, N. (2019). Hierarchical transformers for long document classification. In 2019 IEEE Automatic Speech Recognition And Understanding Workshop (ASRU), pages 838-844. DOI: 10.1109/ASRU46091.2019.9003958.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Walker, M., Ji, H., and Stent, A., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227-2237, New Orleans, Louisiana. Association for Computational Linguistics. DOI: 10.18653/v1/N18-1202.

Rodrigues, R. B. M., Privatto, P. I. M., de Sousa, G. J., Murari, R. P., Afonso, L. C. S., Papa, J. P., Pedronette, D. C. G., Guilherme, I. R., Perrout, S. R., and Riente, A. F. (2022). Petrobert: A domain adaptation language model for oil and gas applications in portuguese. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 101-109. Springer International Publishing. DOI: 10.1007/978-3-030-98305-5_10.

Rust, P., Pfeiffer, J., Vulić, I., Ruder, S., and Gurevych, I. (2021). How good is your tokenizer? on the monolingual performance of multilingual language models. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings Of The 59th Annual Meeting Of The Association For Computational Linguistics And The 11th International Joint Conference On Natural Language Processing (Volume 1: Long Papers), pages 3118-3135, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.acl-long.243.

Sajjad, H., Dalvi, F., Durrani, N., and Nakov, P. (2023). On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77:101429. DOI: 10.1016/j.csl.2022.101429.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11):613-620. DOI: 10.1145/361219.361220.

Santos, F. and Souza, K. (2024). Como combater a corrupção em licitações: detecção e prevenção de fraudes. Fórum, fourth edition. Book.

Schuster, M. and Nakajima, K. (2012). Japanese and korean voice search. In 2012 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pages 5149-5152. DOI: 10.1109/ICASSP.2012.6289079.

Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1):97-123. Available at:[link].

Schütze, H. (1993). Word space. In Advances In Neural Information Processing Systems 5, [NIPS Conference], pages 895-902. DOI: 10.5555/645753.

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Naldi, M. C. and Bianchi, R. A. C., editors, Intelligent Systems, pages 268-282, Cham. Springer Nature Switzerland. DOI: 10.1007/978-3-031-45392-2_18.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403-417, Cham. Springer International Publishing. DOI: 10.1007/978-3-030-61377-8_28.

Srinivasan, S., Ravi, V., Alazab, M., Ketha, S., Al-Zoubi, A., and Kotti Padannayil, S. (2021). Spam emails detection based on distributed word embedding with deep learning. Machine Intelligence And Big Data Analytics For Cybersecurity Applications, pages 161-189. DOI: 10.1007/978-3-030-57024-8_7.

Stollenwerk, F. (2023). Training and evaluation of a multilingual tokenizer for GPT-SW3. DOI: 10.48550/arXiv.2304.14780.

Søgaard, A., Vulić, I., Ruder, S., and Faruqui, M. (2019). Cross-lingual word embeddings. Synthesis Lectures on Human Language Technologies. Springer Cham, 1 edition. DOI: 10.1007/978-3-031-02171-8.

Tagarelli, A. and Simeri, A. (2022). Unsupervised law article mining based on deep pretrained language representation models with application to the italian civil code. Artificial Intelligence And Law, 30(3):417-473. DOI: 10.1007/s10506-021-09301-8.

Viegas, C., Costa, B., and Ishii, R. (2023). Jurisbert: A new approach that converts a classification corpus into an sts one. Computational Science And Its Applications - ICCSA, 2023:349-365. DOI: 10.1007/978-3-031-36805-9_24.

Wu, T., Huang, Q., Liu, Z., Wang, Y., and Lin, D. (2020). Distribution-balanced loss for multi-label classification in long-tailed datasets. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV, page 162–178, Berlin, Heidelberg. Springer-Verlag. DOI: 10.1007/978-3-030-58548-8_10.

Xiao, C., Hu, X., Liu, Z., Tu, C., and Sun, M. (2021). Lawformer: A pretrained language model for chinese legal long documents. AI Open, 2:79-84. DOI: 10.1016/j.aiopen.2021.06.003.

Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., and Liu, T.-Y. (2014). Rc-net: A general framework for incorporating knowledge into word representations. In Proceedings Of The 23rd ACM International Conference On Conference On Information And Knowledge Management, pages 1219-1228, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2661829.2662038.

Xu, C. and McAuley, J. (2023). A survey on model compression and acceleration for pretrained language models. Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):10566-10575. DOI: 10.1609/aaai.v37i9.26255.

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2020). Large batch optimization for deep learning: Training bert in 76 minutes. In Proceedings of Eighth International Conference On Learning Representations. DOI: 10.48550/arxiv.1904.00962.

Ács, J. (2024). Exploring bert's vocabulary. Available at:[link].