Comparison of Clustering Techniques in Text Documents in Portuguese
DOI:
https://doi.org/10.5753/isys.2025.5029Keywords:
Text Clustering, BERT, BERTimbau, K-Means, Single Linkage, Gaussian Mixture ModelAbstract
Managing the vast amount of text data in the digital world is a complex challenge. An effective approach to tackle it is through the technique of text document clustering. This study evaluated the performance of three clustering algorithms — K-Means, Single Linkage, and Gaussian Mixture Model (GMM) — in clustering Brazilian Portuguese news articles using BERTimBau, a Portuguese variant of the BERT model, for preprocessing. Metrics such as accuracy, F1-score, Rand index, and Jaccard coefficient were used for evaluation. The results of these metrics indicated that Single Linkage achieved the best overall performance, surpassing K-Means and GMM in most of the evaluated criteria.
Downloads
References
Aggarwal, C. C. and Reddy, C. K. (2013). Data Clustering: Algorithms and Applications. Data Mining and Knowledge Discovery Series. Chapman & Hall/CRC, Boca Raton, FL.
AI, G. (2018). Open sourcing bert: State-of-the-art pre-training for natural State-of-the-art pre-training for natural language processing. Disponível em: [link]. Acesso em: 27 mar. 2025.
Ayub, M., Ghazanfar, M., Khan, T., et al. (2020). An effective model for Jaccard coefficient to increase the performance of collaborative filtering. Arab Journal of Science and Engineering, 45:9997–10017.
Campello, R. (2007). A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment. Pattern Recognition Letters, 28(7):833–841.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3):37.
Ferreira, W. (2023). Desvendando padrõees de dados: Uma jornada pelos métodos de clusterização (k-means, gaussian mixture, ...). Disponível em: [link]. Acesso em: 27 mar. 2025.
Francisco Jáñez-Martino, Rocío Aláiz-Rodríguez, V. G.-C. E. F. E. A. (2023). Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach. FREEDOM, 139:110226.
Khishigsuren Davagdorj, Ling Wang, M. L. V.-H. P. K. H. R. N. T.-U. (2022). Discovering thematically coherent biomedical documents using contextualized bidirectional encoder representations from transformers-based clustering. 19(10):5893.
Lima, A. (2021). Introdução aos transformers. Disponível em: [link]. Acesso em: 27 mar. 2025.
Manpreet Kaur, . V. (2021). An improved k-means based text document clustering using artificial bee colony with support vector machine. 8(7).
Mohammad Alhawarat, M. O. H. (2018). Revisiting k-means and topic modeling, a comparison study to cluster Arabic documents. 6:42740.
Souto Moreira, L., Machado Lunardi, G., de Oliveira Ribeiro, M., Silva, W., and Paulo Basso, F. (2023). A study of algorithm-based detection of fake news in Brazilian election: Is BERT the best? IEEE Latin America Transactions, 21(8):897–903.
Souza, F., Nogueira, R., and Lotufo, R. (2020a). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.
Souza, F., Nogueira, R., and Lotufo, R. (2020b). BERTimbau: Pretrained BERT models for Brazilian Portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.
Subakti, A., Murfi, H., and Hariadi, N. (2022). The performance of BERT as data representation of text clustering. Journal of Big Data, 9(1):15.
Tan, P.-N., Steinbach, M., and Kumar, V. (2014). Introduction To Data Mining. Pearson, New York.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, Cambridge, MA. MIT Press.
Velankar, A., Patil, H., and Joshi, R. (2023). Mono vs multilingual BERT for hate speech detection and text classification: A case study in Marathi. In El Gayar, N., Trentin, E., Ravanelli, M., and Abbas, H., editors, Artificial Neural Networks in Pattern Recognition, pages 121–128, Cham. Springer International Publishing.
Weiss, S. M., Indurkhya, N., and Zhang, T. (2015). Fundamentals of Predictive Text Mining. Springer International Publishing, London, second edition edition.
Wu, S. and Dredze, M. (2020). Are all languages created equal in multilingual BERT? In Gella, S., Welbl, J., Rei, M., Petroni, F., Lewis, P., Strubell, E., Seo, M., and Hajishirzi, H., editors, Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 iSys - Brazilian Journal of Information Systems

This work is licensed under a Creative Commons Attribution 4.0 International License.

