Comparative Analysis of Undersampling Methods in Transformer-Based Automatic Text Classification

Authors

  • Guilherme Fonseca UFSJ
  • Washington Cunha UFMG
  • Leonardo Rocha UFSJ

DOI:

https://doi.org/10.5753/reic.2024.4643

Abstract

Automatic Text Classification (ATC) in unbalanced databases is a common challenge in real-world applications. In this scenario, one of the classes is underrepresented, which could cause a bias in the learning process. This work investigates the effect of undersampling methods, which aim to reduce instances of the majority class, on the performance of recent ATC strategies based on transformers. We evaluated 15 existing undersampling strategies and one proposal in this work. Our results suggest that undersampling approaches are important for improving the performance of classification methods on imbalanced collections, not only reducing learning bias but also reducing training costs.

Downloads

Download data is not yet available.

References

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023a). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th ACM SIGIR.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023b). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Computing Surveys.

Czarnowska, P., Vyas, Y., and Shah, K. (2021). Quantifying social biases in nlp: A generalization and empirical comparison of extrinsic fairness metrics. TACL.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer.

Hart, P. (1968). The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3):515–516.

Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Icml, volume 97, page 179. Citeseer.

Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In 8th Conference on Artificial Intelligence in Medicine in Europe, AIME 2001 Cascais, Portugal, July 1–4, 2001, Proceedings 8, pages 63–66. Springer.

Lin, W.-C., Tsai, C.-F., Hu, Y.-H., and Jhang, J.-S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409:17–26.

Mani, I. and Zhang, I. (2003). knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets, volume 126, pages 1–7. ICML.

Ng, A. (2017). Machine learning yearning. URL: [link], 139.

Smith, M. R., Martinez, T., and Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine learning, 95:225–256.

Tomek, I. (1976a). An experiment with the edited nearest-nieghbor rule.

Tomek, I. (1976b). Two modifications of cnn. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11):769–772.

Vuttipittayamongkol, P., Elyan, E., Petrovski, A., and Jayne, C. (2018). Overlap-based undersampling for improving imbalanced data classification. In 19th IDEAL 2018, Madrid, Spain, November 21–23, 2018, pages 689–697. Springer.

Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421.

Yen, S.-J. and Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In ICIC Kunming, China, August 16–19, 2006, pages 731–740. Springer.

Published

2024-06-28

How to Cite

Fonseca, G., Cunha, W., & Rocha, L. (2024). Comparative Analysis of Undersampling Methods in Transformer-Based Automatic Text Classification. Electronic Journal of Undergraduate Research on Computing, 22(1), 1–10. https://doi.org/10.5753/reic.2024.4643

Issue

Section

Full Papers