Enhancing the Performance of Machine Learning Classifiers through Data Cleaning with Ensemble Confident Learning

Authors

  • Renato Okabayashi Miyaji Escola Politécnica da Universidade de São Paulo
  • Felipe Valencia de Almeida Escola Politécnica da Universidade de São Paulo
  • Pedro Luiz Pizzigatti Corrêa Escola Politécnica da Universidade de São Paulo

DOI:

https://doi.org/10.5753/jidm.2025.4230

Keywords:

Data Cleaning, Machine Learning, Confident Learning, Species Distribution Modeling

Abstract

Model-centric techniques, such as hyper parameter optimization and regularization, are commonly used in the literature to enhance the performance of Machine Learning Classifiers. However, when dealing with noisy data, Data-Centric approaches show promising potential. Thus, in this paper a new method is proposed: the Ensemble Confident Learning (ECL), which enhances the Confident Learning technique with the use of multiple learners to improve the selection of instances with biased labels. This method was applied for a case study of Species Distribution Modeling in the Amazon using Classifiers to estimate the probability of species occurrence based on environmental conditions. Compared to Confident Learning, ECL showed an improvement of 20% in Recall and 3.5% in ROC-AUC for Logistic Regression.

Downloads

Download data is not yet available.

References

Almeida, F. V., Bueno, W. M., Miyaji, R. O., and Corrêa, P. L. P. (2021). Experimento de modelagem de distribuição de espécies baseada em variáveis ambientais e de aerossóis na região próxima a manaus (am). In Anais do XII Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais. SBC.

Beery, S., Cole, E., Parker, J., Perona, P., and Winner, K. (2021). Species distribution modeling for machine learning practitioners: A review. In Proceedings of ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS) 2021.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of 26th International Conference on Machine Learning. ACM.

Di Lorenzo, B., Farcomeni, A., and Golini, N. (2011). A bayesian model for presence-only semicontinuous data, with application to prediction of abundance of taxus baccata in two italian regions. Journal of Agriculture Biological and Environmental Statistics, 16:339–356.

Elcan, K. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01).

Elcan, K. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2008.

Forman, G. (2005). Counting positives accurately despite inaccurate classification. In Proceedings of the 16th European Conference on Machine Learning. GBIF (2024). Gbif | global biodiversity information facility. [link]. Online; accessed 09-March-2024.

Golini, N. (2011). Bayesian Modelling of Presence-only Data. PhD thesis, Spienza Universidade de Roma.

Hamid, O. H. (2022). From model-centric to data-centric ai: A paradigm shift or rather a complementary approach? In Proceedings of 2022 8th International Conference on Information Technology Trends (ITT), pages 45–54. IEE.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Proceeding of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018).

Hegel, T. M., Cushman, A., Evans, J., and Huetmann, F. (2010). Spatial Complexity, Informatics and Wildlife Conservation, chapter Current State of the Art for Statistical Modelling of Species Distributions. Springer.

Hernandez, P. A., Graham, C. H., Master, L. L., and Albert, D. L. (2006). The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography, 29(5):773–785. DOI: https://doi.org/10.1111/j.0906-7590.2006.04700.x.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1):69–82.

Huang, J., Qu, L., Jia, R., and Zhao, B. (2019). O2u-net: A simple noisy label detection approach for deep neural networks. In Proceedings of the International Conference on Computer Vision (ICCV) 2019.

Hutchinson, G. E. (1991). Population studies: Animal ecology and demography. Bulletin of Mathematical Biology, 53(1-2):193–213.

ICMBio (2024). Portal da biodiversidade do instituto chico mendes de conservação da biodiversidade. [link]. Online; accessed 09-March-2024.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer, Londres.

Johnson, R., Chawla, N., and Hellmann, J. (2012). Species distribution modeling and prediction: A class imbalance problem. pages 9–16. DOI: 10.1109/CIDU.2012.6382186.

Li, Y., De-Arteaga, M., and Saar-Tsechansky, M. (2023). Mitigating label bias via decoupled confident learning. In Proceeding of the AI HCI Workshop at the 40th International Conference on Machine Learning (ICML).

Lipton, Z., Wang, Y., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. In Proceedings of the International Conference on Machine Learning (ICML) 2018.

Marsh, J. C., Gavish, Y., Kuemmerlen, M. C., Stoll, S., Haase, P., and Kunin, W. E. (2023). Sdm profiling: A tool for assessing the information-content of sampled and unsampled locations for species distribution models. Ecological Modelling, 475(1).

Martin, S. T., Artaxo, P., Machado, L., Manzi, A. O., Souza, R. A. F. d., Schumacher, C., Wang, J., Biscaro, T., Brito, J., Calheiros, A., et al. (2017). The green ocean amazon experiment (goamazon2014/5) observes pollution affecting gases, aerosols, clouds, and rainfall over the rain forest. Bulletin of the American Meteorological Society, 98(5):981–997.

Martin, S. T., Artaxo, P., Machado, L. A. T., Manzi, A. O., Souza, R. A. F. d., Schumacher, C., Wang, J., Andreae, M. O., Barbosa, H., Fan, J., et al. (2016). Introduction: observations and modeling of the green ocean amazon (goamazon2014/5). Atmospheric Chemistry and Physics, 16(8):4785–4797.

Martin, T. G., Kuhnert, P. M., Mengersen, K., and Possingham, H. P. (2005). The power of expert opinion in ecological models using bayesian methods: Impact of grazing on birds. Ecological Applications, 15:266–280.

Mateo, R. G., Vanderpoorten, A., Muñoz, J., Laenen, B., and Désamoré, A. (2013). Modeling species distributions from heterogeneous data for the biogeographic regionalization of the european bryophyte flora. PLoS One, 8(2):e55648.

Miyaji, R., Almeida, F., and Corrêa, P. (2023). Aplicação de técnicas de confident learning para limpeza de dados e melhoria de desempenho de classificadores de aprendizado de máquina: um estudo de caso. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados.

Miyaji, R. O., Bauer, L. O., Ferrari, V. M., Almeida, F. V., Corrêa, P. L. P., and Rizzo, L. V. (2021). Interpolação espacial de variáveis ambientais e aerossóis na região da bacia amazônica próxima a manaus-am. In Anais do XII Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais. SBC.

Miyaji, R. O. and Corrêa, P. L. P. (2021). Handling uncertainty through bayesian inference for species distribution modelling in the amazon basin region. In 2021: ANAIS DO XVIII ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL.

Northcutt, C. G., Athalye, A., and Mueller, J. (2021a). Pervasive label errors in test sets destabilize machine learning benchmarks. In Proceedings of 35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021b). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 70(1):1373–1411.

Pinaya, J. and Corrêa, P. (2014). Metodologia para definição das atividades do processo de modelagem de distribuição de espécies. In Anais do V Workshop de Computação Aplicada a Gestão do Meio Ambiente e Recursos Naturais, pages 45–54, Porto Alegre, RS, Brasil. SBC.

Shimizu, R., Asako, K., Ojima, H., Morinaga, S., Hamada, M., and Kuroda, T. (2018). Balanced mini-batch training for imbalanced image data classification with neural network. In Proceeding of the First International Conference on Artificial Intelligence for Industries (AI4I).

The Imbalanced-learn Developers (2024). Imbalanced-learn documentation. [link]. Online; accessed 09-March-2024.

Tibshirani, R. (1996). Regression shrinkage and selection via lasso. Journal of the Royal Statistical Society, 58(1):267–288.

Zhang, Y., Li, B., Ling, Z., and Zhou, F. (2023). Mitigating label bias in machine learning: Fairness through confident learning. arXiv, 2312.08749. DOI: https://doi.org/10.48550/arXiv.2312.08749.

Downloads

Published

2025-01-20

How to Cite

Okabayashi Miyaji, R., Valencia de Almeida, F., & Luiz Pizzigatti Corrêa, P. (2025). Enhancing the Performance of Machine Learning Classifiers through Data Cleaning with Ensemble Confident Learning. Journal of Information and Data Management, 16(1), 72–81. https://doi.org/10.5753/jidm.2025.4230

Issue

Section

SBBD 2023 Full papers - Extended papers