LiPSet: A Comprehensive Dataset of Labeled Portuguese Public Bidding Documents
DOI:
https://doi.org/10.5753/jidm.2024.3460Keywords:
dataset, document classification, data labeling, open government data, public biddingAbstract
Collecting, processing, and organizing governmental public documents pose significant challenges due to their diverse sources and formats, complicating data analysis. In this context, this work introduces LiPSet, a comprehensive dataset of labeled documents from Brazilian public bidding processes in Minas Gerais state. We provide an overview of the data collection process and present a methodology for data labeling that includes a meta-classifier to assist in the manual labeling process. Next, we perform an exploratory data analysis to summarize the key features and contributions of the LiPSet dataset. We also showcase a practical application of LiPSet by employing it as input data for classifying bidding documents. The results of the classification task exhibit promising performance, demonstrating the potential of LiPSet for training neural network models. Finally, we discuss various applications of LiPSet and highlight the primary challenges associated with its utilization.
Downloads
References
Abidi, W. U. H., Daoud, M. S., Ihnaini, B., Khan, M. A., Alyas, T., Fatima, A., and Ahmad, M. (2021). Real-Time Shill Bidding Fraud Detection Empowered With Fussed Machine Learning. IEEE Access, 9:113612–113621. DOI: 10.1109/ACCESS.2021.3098628.
Anowar, F. and Sadaoui, S. (2019). Multi-class Ensemble Learning of Imbalanced Bidding Fraud Data. In Advances in Artificial Intelligence - 32nd Canadian Conference on Artificial Intelligence, Canadian AI, volume 11489 of Lecture Notes in Computer Science, pages 352–358. Springer. DOI: 10.1007/978-3-030-18305-9_29.
Araújo, L. R. and Souza, J. F. (2011). Aumentando a transparência do governo por meio da transformação de dados governamentais abertos em dados ligados. Revista Eletrônica de Sistemas de Informação, 10(1).
Carneiro, M. G., Cupertino, T. H., Zhao, L., and Rosa, J. L. (2017). Semi-supervised semantic role labeling for Brazilian Portuguese. Journal of Information and Data Management, 8(2):117–117.
Clarindo, J. P., Fontes, W., and Coutinho, F. (2019). QualiSUS: um dataset sobre dados da saúde pública no brasil. In Anais do II Dataset Showcase Workshop, DSW, pages 418–428. SBC.
Coelho, G. M. C., Ramos, A. C., de Sousa, J., Cavaliere, M., de Lima, M. J., Mangeth, A., Frajhof, I. Z., Cury, C., and Casanova, M. A. (2022). Text Classification in the Brazilian Legal Domain. In Proceedings of the 24th International Conference on Enterprise Information Systems, ICEIS, pages 355–363. SCITEPRESS. DOI: 10.5220/0011062000003179.
Costa, L., Reis, A., Bacha, C. A., Oliveira, G. P., Silva, M. O., Teixeira, M. C., Brandão, M. A., Lacerda, A., and Pappa, G. (2022). Alertas de fraude em licitações: Uma abordagem baseada em redes sociais. In Anais do XI Brazilian Workshop on Social Network Analysis and Mining, BraSNAM, pages 37–48. SBC. DOI: 10.5753/brasnam.2022.223175.
da Silva, L. C., Junior, R. d. V. C., de Araújo Lopes, H., and dos Santos, M. (2020). Utilização de técnicas de Mineração de Dados para detectar possíveis relacionamentos entre empresas participantes de licitações nas Forças Armadas. Acanto em Revista, 7(7):85–85.
de Oliveira, E. F. and Silveira, M. S. (2018). Open government data in brazil a systematic review of its uses and issues. In Proceedings of the 19th Annual International Conference on Digital Government Research: Governance in the Data Age, DG.O, pages 60:1–60:9. ACM. DOI: 10.1145/3209281.3209339.
Erven, G. C. G. V., Holanda, M., and Carvalho, R. N. (2017). Detecting Evidence of Fraud in the Brazilian Government Using Graph Databases. In Recent Advances in Information Systems and Technologies - Volume 2, WorldCIST, volume 570 of Advances in Intelligent Systems and Computing, pages 464–473. Springer. DOI: 10.1007/978-3-319-56538-5_47.
Gabardo, A. C. and Lopes, H. S. (2014). Using Social Network Analysis to Unveil Cartels in Public Bids. In 2014 European Network Intelligence Conference, ENIC, pages 17–21. IEEE Computer Society. DOI: 10.1109/ENIC.2014.11.
Houdt, G. V., Mosquera, C., and Nápoles, G. (2020). A review on the long short-term memory model. Artificial Intelligence Review, 53(8):5929–5955. DOI: 10.1007/S10462-020-09838-1.
Lima, M. C., Silva, R., de Souza Mendes, F. L., de Carvalho, L. R., Araújo, A. P. F., and de Barros Vidal, F. (2020). Inferring about fraudulent collusion risk on Brazilian public works contracts in official texts using a Bi-LSTM approach. In Findings of the Association for Computational Linguistics: EMNLP 2020, volume EMNLP 2020 of Findings of ACL, pages 1580–1588. Association for Computational Linguistics. DOI: 10.18653/V1/2020.FINDINGS-EMNLP.143.
Lyra, M. S., Curado, A., Damásio, B., Bação, F., and Pinheiro, F. L. (2021). Characterization of the firm–firm public procurement co-bidding network from the State of Ceará (Brazil) municipalities. Applied Network Science, 6(1):1–10.
Mata, W. R. R. d., Boechat, D. S., and Brandão, M. A. (2019). JusBD: Um Banco de Dados para Obtenção de Informações do Poder Judiciário. In Anais do II Dataset Showcase Workshop, DSW, pages 398–407. SBC.
Meera, S. and Geerthik, S. (2022). Natural Language Processing. Artificial Intelligent Techniques for Wireless Communication and Networking, pages 139–153. DOI: https://doi.org/10.1002/9781119821809.ch10.
Nai, R., Sulis, E., and Meo, R. (2022). Public Procurement Fraud Detection and Artificial Intelligence Techniques: a Literature Review. In Companion Proceedings of the 23rd International Conference on Knowledge Engineering and Knowledge Management, EKAW-C, volume 3256 of CEUR Workshop Proceedings. CEUR-WS.org.
Noguti, M. Y., Vellasques, E., and Oliveira, L. S. (2020). Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service. In 2020 International Joint Conference on Neural Networks, IEEE IJCNN, pages 1–8. IEEE. DOI: 10.1109/IJCNN48605.2020.9207211.
Pedrosa, J. A. O., Oliveira, D. M., Meira Jr., W., and P. Ribeiro, A. L. (2021). Automated classification of cardiology diagnoses based on textual medical reports. Journal of Information and Data Management, 12(1). DOI: 10.5753/jidm.2021.1940.
Pereira, L. S. (2022). Caracterização da comunidade que utiliza dados abertos governamentais sobre a educação brasileira. Master’s thesis, Universidade Federal de Campina Grande, Campina Grande, Brasil.
Pereira, R. and Murai, F. (2021). Quão efetivas são Redes Neurais baseadas em Grafos na Detecção de Fraude para Dados em Rede? In Anais do X Brazilian Workshop on Social Network Analysis and Mining, BraSNAM, pages 205–210. SBC. DOI: 10.5753/brasnam.2021.16141.
Shimron, E., Tamir, J. I., Wang, K., and Lustig, M. (2022). Implicit data crimes: Machine learning bias arising from misuse of public data. Proceedings of the National Academy of Sciences, 119(13):e2117203119. DOI: 10.1073/pnas.2117203119.
Silva, M. O., Paula, A. F., Oliveira, G. P., Vaz, I. A., Hott, H., Gomide, L. D., Reis, A. P., Mendes, B. M., Bacha, C. A., Costa, L. L., et al. (2022). Lipset: Um conjunto de dados com documentos rotulados de licitaçoes publicas. In Anais do IV Dataset Showcase Workshop, DSW, pages 13–24. SBC. DOI: 10.5753/dsw.2022.224925.
Velasco, R. B., Carpanese, I., Interian, R., Neto, O. C. G. P., and Ribeiro, C. C. (2021). A decision support system for fraud detection in public procurement. International Transactions in Operational Research, 28(1):27–47. DOI: 10.1111/itor.12811.