Evaluating Preprocessing and Textual Representation on Brazilian Public Bidding Document Classification
DOI:
https://doi.org/10.5753/jidm.2025.4344Keywords:
Public Bids, Digital Government, Document Classification, Text Preprocessing, Text RepresentationAbstract
In this paper, we tackle the task of classifying public bidding documents, which holds significant importance for both public and private entities seeking precise insights into bidding processes. Our study evaluates the impact of various preprocessing techniques and textual representation models, particularly word embeddings, on the accuracy of document classification. Overall, our results reveal while preprocessing techniques have minimal influence on classification outcomes, the choice of textual representation model significantly affects the representativeness of document classes. Moreover, we perform a qualitative analysis of misclassification cases, providing valuable insights into potential areas for improvement in document classification methodologies. Our findings underscore the importance of selecting appropriate textual representation models to enhance the accuracy and efficiency of document classification systems.
Downloads
References
Aguiar, A., Silveira, R., Pinheiro, V., Furtado, V., and Neto, J. A. (2021). Text classification in legal documents extracted from lawsuits in brazilian courts. In Britto, A. and Valdivia Delgado, K., editors, Brazilian Conference on Intelligent Systems, pages 586–600. Springer International Publishing.
Albalawi, Y., Buckley, J., and Nikolov, N. S. (2021). Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting arabic health information on social media. J. Big Data, 8(1):95. DOI: 10.1186/s40537-021-00488-w.
Bambroo, P. and Awasthi, A. (2021). Legaldb: long distilbert for legal document classification. In 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), pages 1–4. IEEE.
Brandão, M., Silva, M., Oliveira, G., Hott, H., Lacerda, A., and Pappa, G. (2023). Impacto do pré-processamento e representação textual na classificação de documents de licitações. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 102–114, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2023.231658.
Church, K. W. (2017). Word2vec. Natural Language Engineering, 23(1):155–162.
Coelho, G. M., Ramos, A. C., de Sousa, J., Cavaliere, M., de Lima, M. J., Mangeth, A., Frajhof, I. Z., Cury, C., and Casanova, M. A. (2022). Text classification in the brazilian legal domain. In ICEIS (1), pages 355–363.
Kim, H.-Y. (2014). Statistical notes for clinical researchers: Nonparametric statistical methods: 2. nonparametric methods for comparing three or more groups and repeated measures. Restorative Dentistry & Endodontics, 39(4):329–332.
Lima, M., Silva, R., Lopes de Souza Mendes, F., R. de Carvalho, L., Araujo, A., and de Barros Vidal, F. (2020). Inferring about fraudulent collusion risk on Brazilian public works contracts in official texts using a Bi-LSTM approach. In Findings of the Association for Computational Linguistics, pages 1580–1588, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.143.
Luz de Araujo, P. H., de Almeida, A. P. G. S., Braz, F. A., da Silva, N. C., de Barros Vidal, F., and de Campos, T. E. (2023). Sequence-aware multimodal page classification of brazilian legal documents. Int. J. Document Anal. Recognit., 26(1):33–49.
Luz de Araujo, P. H., de Campos, T. E., Ataides Braz, F., and Correia da Silva, N. (2020). VICTOR: a dataset for Brazilian legal documents classification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1449–1458, Marseille, France. European Language Resources Association.
Martins, V. S. and Silva, C. D. (2023). Text classification in law area: a systematic review. Journal of Information and Data Management, 13(6). DOI: 10.5753/jidm.2022.2547.
Muniz Belém, F., Valiense, C., França, C., Carvalho, M., Ganem, M., Teixeira, G., Jallais, G., H. F. Laender, A., and A. Gonçalves, M. (2023). Contextual reinforcement, entity delimitation and generative data augmentation for entity recognition and relation extraction in official documents. Journal of Information and Data Management, 14(1). DOI: 10.5753/jidm.2023.3180.
Noguti, M. Y., Vellasques, E., and Oliveira, L. S. (2020). Legal document classification: An application to law area prediction of petitions to public prosecution service. In 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020, pages 1–8. IEEE. DOI: 10.1109/IJCNN48605.2020.9207211.
Oliveira, G. P., Reis, A. P. G., Mendes, B. M. A., Bacha, C. A., Costa, L. L., Canguçu, G. L., Silva, M. O., Caetano, V., Brandão, M. A., Lacerda, A., and Pappa, G. L. (2022). Ferramentas open-source de qualidade de dados para licitações públicas: Uma análise comparativa. In SBBD, pages 116–127. SBC.
P. Oliveira, G., M. A. Mendes, B., A. Bacha, C., L. Costa, L., D. Gomide, L., O. Silva, M., A. Brandão, M., Lacerda, A., and L. Pappa, G. (2023). Assessing data quality inconsistencies in brazilian governmental data. Journal of Information and Data Management, 14(1). DOI: 10.5753/jidm.2023.3220.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, pages 1532–1543. ACL. DOI: 10.3115/v1/d14-1162.
Poetsch, M., Correa, U. B., and de Freitas, L. A. (2019). A word embedding analysis towards ontology enrichment. Res. Comput. Sci., 148(11):153–164.
Silva, M. O., Oliveira, G. P., Hott, H., Gomide, L. D., Mendes, B. M. A., Bacha, C. A., Costa, L. L., Brandão, M. A., Lacerda, A., and Pappa, G. L. (2024). Lipset: A comprehensive dataset of labeled portuguese public bidding documents. Journal of Information and Data Management. to appear in.
Silva, M. O. et al. (2022). LiPSet: Um conjunto de Dados com Documentos Rotulados de Licitações Públicas. In SBBD DSW, pages 13–24, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/dsw.2022.224925.
Souza Júnior, A. P., Cecilio, P., Viegas, F., Cunha, W., de Albergaria, E. T., and da Rocha, L. C. D. (2022). Evaluating topic modeling pre-processing pipelines for portuguese texts. In WebMedia, pages 191–201. ACM.
Wang, S., Zhou, W., and Jiang, C. (2019). A survey of word embeddings based on deep learning. Computing, 102:717–740. DOI: 10.1007/s00607-019-00768-7.
Zhang, J., Li, Y., Tian, J., and Li, T. (2018). Lstm-cnn hybrid model for text classification. In 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pages 1675–1680. IEEE.

