Análise Comparativa de Métodos de Undersampling em Classificação Automática de Texto Baseada em Transformers
DOI:
https://doi.org/10.5753/reic.2024.4643Abstract
Classificação Automática de Texto (CAT) em bases de dados desbalanceadas é um desafio comum em aplicações do mundo real. Nesse cenário, uma das classes é sub-representada, podendo provocar um viés no processo de aprendizado. Este trabalho investiga o efeito de métodos de undersampling, que visam reduzir instâncias da classe majoritária, no desempenho de estratégias de CAT recentes, baseada em transformers. Avaliamos 15 estratégias existentes de undersampling e uma proposta nesse trabalho. Nossos resultados sugerem que as abordagens de undersampling são importantes para melhorar o desempenho de métodos de classificação em coleções desbalanceadas, não apenas reduzindo o viés de aprendizado, mas também reduzindo o custo de treinamento.Downloads
Referências
Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023a). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th ACM SIGIR.
Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023b). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Computing Surveys.
Czarnowska, P., Vyas, Y., and Shah, K. (2021). Quantifying social biases in nlp: A generalization and empirical comparison of extrinsic fairness metrics. TACL.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer.
Hart, P. (1968). The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3):515–516.
Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Icml, volume 97, page 179. Citeseer.
Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. In 8th Conference on Artificial Intelligence in Medicine in Europe, AIME 2001 Cascais, Portugal, July 1–4, 2001, Proceedings 8, pages 63–66. Springer.
Lin, W.-C., Tsai, C.-F., Hu, Y.-H., and Jhang, J.-S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409:17–26.
Mani, I. and Zhang, I. (2003). knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets, volume 126, pages 1–7. ICML.
Ng, A. (2017). Machine learning yearning. URL: [link], 139.
Smith, M. R., Martinez, T., and Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine learning, 95:225–256.
Tomek, I. (1976a). An experiment with the edited nearest-nieghbor rule.
Tomek, I. (1976b). Two modifications of cnn. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11):769–772.
Vuttipittayamongkol, P., Elyan, E., Petrovski, A., and Jayne, C. (2018). Overlap-based undersampling for improving imbalanced data classification. In 19th IDEAL 2018, Madrid, Spain, November 21–23, 2018, pages 689–697. Springer.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421.
Yen, S.-J. and Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In ICIC Kunming, China, August 16–19, 2006, pages 731–740. Springer.
Downloads
Published
Como Citar
Issue
Section
Licença
Copyright (c) 2024 The authors
Este trabalho está licenciado sob uma licença Creative Commons Attribution-NonCommercial 4.0 International License.