Assessing Data Quality Inconsistencies in Brazilian Governmental Data
DOI:
https://doi.org/10.5753/jidm.2023.3220Keywords:
data quality, governmental data, great expectations, public bids, public expenditureAbstract
In recent years, vast volumes of data are constantly being made available on the Web, and they have been increasingly used as decision support in different contexts. However, for these decisions to be more assertive and reliable, it is necessary to ensure data quality. Although there are several definitions for this area, it is a consensus that data quality is always associated with a specific context. This work aims to analyze data quality in a data warehouse with governmental information of the Brazilian state of Minas Gerais. We first present a brief comparison of eight open-source data quality tools and then choose the Great Expectations tool for analyzing such data in two real applications: public bids and public expenditure. Our analyses show that the chosen tool has relevant characteristics to generate good data quality indicators to reveal data quality issues that may directly impact the construction of final applications using such data.
Downloads
References
Altendeitering, M. and Tomczyk, M. (2022). A functional taxonomy of data quality tools: Insights from science and practice. In Wirtschaftsinformatik.
Ballou, D. P. and Pazer, H. L. (1985). Modeling data and process quality in multi-input, multi-output information systems. Management Science, 31(2):150–162.
Chrisman, N. R. (1983). The role of quality information in the long-term functioning of a geographic information system. In Auto-Carto, pages 303–312.
Cichy, C. and Rass, S. (2019). An overview of data quality frameworks. IEEE Access, 7:24634–24648.
Ehrlinger, L. and Wöß, W. (2018). A novel data quality metric for minimality. QUAT, 1:1 – 15. DOI: 10.1007/978-3-030-19143-6_1.
Ehrlinger, L. and Wöß, W. (2022). A survey of data quality measurement and monitoring tools. Front. Big Data, 5. DOI: 10.3389/fdata.2022.850611.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16. DOI: 10.1109/TKDE.2007.250581.
Etcheverry, L. and Consens, M. P. (2011). Summary-based comparison of data quality across public MAGE-ML genomic datasets. J. Inf. Data Manag., 2(1):3–10.
Foidl, H., Felderer, M., and Ramler, R. (2022). Data smells: Categories, causes and consequences, and detection of suspicious data in ai-based systems. In arXiv. DOI: 10.48550/ARXIV.2203.10384.
Gao, J. Z., Xie, C., and Tao, C. (2016). Big data validation and quality assurance - issuses, challenges, and needs. In SOSE, pages 433–441. IEEE Computer Society. DOI: 10.1109/SOSE.2016.63.
Goudar, S. S., Stolka, K. B., Koso-Thomas, M., Honnungar, N. V., Mastiholi, S. C., Ramadurg, U. Y., Dhaded, S. M., Pasha, O., Patel, A., Esamai, F., et al. (2015). Data quality monitoring and performance metrics of a prospective, population-based observational study of maternal and newborn health in low resource settings. Reproductive Health, 12(2):1–10. DOI: 10.1186/1742-4755-12-S2-S2.
Josko, J. M. B. and Ferreira, J. E. (2021). Using visual-interactive properties to support data quality visual assessment on abstract and timeless data. J. Inf. Data Manag., 12(2).
Junior, C. S. and Dorneles, C. F. (2021). Avaliação de dimensões de qualidade de dados para o agronegócio. In SBBD, pages 283–288. SBC.
Laranjeiro, N., Soydemir, S. N., and Bernardino, J. (2015). A survey on data quality: Classifying poor data. PRDC, pages 179 – 188. DOI: 10.1109/PRDC.2015.41.
Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. (2002). Aimq: a methodology for information quality assessment. Information & Management, 40(2):133 – 146.
Maia, P., Meira Jr., W., Cerqueira, B., and Cruz, G. (2020). Auditing government purchases with a multicriteria anomaly detection strategy. J. Inf. Data Manag., 11(1).
Medeiros, G. F. d., Degrossi, L. C., and Holanda, M. (2020). Qualiosm: Melhorando a qualidade dos dados na ferramenta de mapeamento colaborativo openstreetmap. In SBBD, pages 77–82. SBC.
Oliveira, G. P., Reis, A. P. G., Freitas, F. A. N., Costa, L. L., Silva, M. O., Brum, P. P. V., Oliveira, S. E. L., Brandão, M. A., Lacerda, A., and Pappa, G. L. (2022a). Detecting inconsistencies in public bids: An automated and data-based approach. In Proceedings of the 28th Brazilian Symposium on Multimedia and Web, pages 182–190, New York, NY, USA. ACM. DOI: 10.1145/3539637.3558230.
Oliveira, G. P., Reis, A. P. G., Mendes, B. M. A., Bacha, C. A., Costa, L. L., Canguçu, G. L., Silva, M. O., Caetano, V., Brandão, M. A., Lacerda, A., and Pappa, G. L. (2022b). Ferramentas open-source de qualidade de dados para licitações públicas: Uma análise comparativa. In Proceedings of the 37th Brazilian Symposium on Databases, pages 116–127, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224351.
Pipino, L. L., Lee, Y. W., and Wang, R. Y. (2002). Data quality assessment. Commun. ACM, 45(4):211 – 218. DOI: 10.1145/505248.506010.
Pushkarev, V., Neumann, H., Varol, C., and Talburt, J. R. (2010). An overview of open source data quality tools. In IKE, pages 370–376. CSREA Press.
Scannapieco, M. and Catarci, T. (2002). Data quality under a computer science perspective. Journal of The ACM - JACM, 2:1–12.
Sessions, V. and Valtorta, M. (2006). The effects of data quality on machine learning algorithms. In ICIQ, pages 485–498. MIT.
Wang, R. Y., Strong, D. M., and Guarascio, L. M. (2018). Beyond accuracy: What data quality means to data consumers. 1996. Total Data Quality Management Programme.
Wu, D., Xu, H., Wang, Y., and Zhu, H. (2022). Quality of government health data in COVID-19: definition and testing of an open government health data quality evaluation framework. Libr. Hi Tech, 40(2):516–534. DOI: 10.1108/LHT-04-2021-0126.
Zöllner, F. G., Daab, M., Sourbron, S. P., Schad, L. R., Schoenberg, S. O., and Weisser, G. (2016). An open source software for analysis of dynamic contrast enhanced magnetic resonance images: Ummperfusion revisited. BMC Med Imaging, 16(7):1–13. DOI: 10.1186/s12880-016-0109-0.