Enhancing Contributions to Brazilian Social Media Analysis Based on Topic Modeling with Native BERT Models
DOI:
https://doi.org/10.5753/jidm.2025.4670Keywords:
Computational Social Sciences, Digital Humanities, Natural Language Processing, Social Media Analysis, Topic ModelingAbstract
This study introduces a computational approach utilizing natural language processing for text analysis, particularly focusing on topic modeling from large-scale textual data. Given the increasing volume of information shared on social media platforms like X (Twitter), there is a pressing need for effective methods to extract and understand the underlying topics in these texts. Continuing the previous work with LDA, BTM, NMF, and BERTopic, we conducted experiments using advanced BERT embedding models tailored for Brazilian Portuguese, namely BERTimbau and BERTweet.BR, alongside the standard multilingual BERTopic model. We also perfomed experiments with LLM embedding models inside the BERTopic structure, NV-Embed-v2, and gte-Qwen2-7B-instruct. Our findings reveal that the gte-Qwen2-7B-instruct model outperforms the others regarding topic coherence, followed by NV-Embed-v2, BERTimbau Large, BERTimbau Base, BERTweet.BR, and the standard multilingual BERTopic. In the case of the BERT models, this demonstrates the superior capability of models trained specifically on Brazilian Portuguese data in capturing the nuances of the language. In the case of the LLM models, the multilingual capability (including Portuguese) demonstrates the performance of gte-Qwen2-7B-instruct over NV-Embed-v2. The enhanced performance of the gte-Qwen2-7B-instruct model highlights the importance of larger model sizes in achieving higher accuracy and coherence in topic modeling tasks. These results contribute valuable insights for future research in social, political, and economic issue analysis through social media data.
Downloads
References
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques.
Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with Python. O’Reilly Media.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of machine Learning research, 3(Jan):993–1022.
Boon-Itt, S. and Skunkan, Y. (2020). Public perception of the covid-19 pandemic on Twitter: Sentiment analysis and topic modeling study. JMIR Public Health Surveill, 6(4):e21978. DOI: 10.2196/21978.
Churchill, R. and Singh, L. (2022). The evolution of topic modeling. ACM Comput. Surv., 54(10s). DOI: 10.1145/3507900.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391–407. DOI: https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
Egger, R. and Yu, J. (2022). A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Frontiers in Sociology, 7. DOI: 10.3389/fsoc.2022.886498.
Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. (2017). Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics, 5:529–542.
Gomes, G. B. and Attux, R. (2023). Contributions to social media analysis based on topic modelling. In Anais do XI Symposium on Knowledge Discovery, Mining and Learning, pages 113–120, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/kdmile.2023.231795.
Goswami, A., Kumar, A., and Pramod, D. (2024). Bursty event detection model for Twitter. In Devismes, S., Mandal, P. S., Saradhi, V. V., Prasad, B., Molla, A. R., and Sharma, G., editors, Distributed Computing and Intelligent Technology, pages 338–355, Cham. Springer Nature Switzerland.
Grootendorst, M. (2022a). Bertopic. [link].
Grootendorst, M. (2022b). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
Guimarães, S. A. d. S., Rocha, E. S. S., and Mugnaini, R. (2023). Estudo cientométrico da atividade acadêmica sobre as temáticas de humanidades digitais e big data nas universidades estaduais paulistas. Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação, 28:1–34. DOI: 10.5007/1518-2924.2023.e90566.
Halbwachs, M. (1950). La mémoire collective [la memoria colectiva]. Paris, Francia: Presses Universitaires de France.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 50–57, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/312624.312649.
Honnibal, M., Montani, I., Van Landeghem, S., and Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. DOI: 10.5281/zenodo.1212303.
Invernici, F., Curati, F., Jakimov, J., Samavi, A., and Bernasconi, A. (2024). Capturing research literature attitude towards sustainable development goals: an LLM-based topic modeling approach.
Jónsson, E. (2016). An evaluation of topic modelling techniques for Twitter.
Karami, A., Bennett, L. S., and He, X. (2018). Mining public opinion about economic issues. Int. J. Strat. Decis. Sci., 9(1):18–28.
KH, M., Zainuddin, H., and Wabula, Y. (2022). Twitter social media conversion topic trending analysis using Latent Dirichlet Allocation algorithm. Journal of Applied Engineering and Technological Science (JAETS), 4(1):390–399. DOI: 10.37385/jaets.v4i1.1143.
Krippendorff, K. (2018). Content analysis. SAGE Publications, Thousand Oaks, CA, 4 edition.
Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., Nelson, A., Salganik, M. J., Strohmaier, M., Vespignani, A., and Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507):1060–1062. DOI: 10.1126/science.aaz8170.
Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., and Ping, W. (2024). NV-Embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428.
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., and Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). RoBERTa: A robustly optimized bert pretraining approach.
Lotto, M., Zakir Hussain, I., Kaur, J., Butt, Z. A., Cruvinel, T., and Morita, P. P. (2023). Analysis of fluoride-free content on Twitter: Topic modeling study. J Med Internet Res, 25:e44586. DOI: 10.2196/44586.
Lyu, J. C. and Luli, G. K. (2021). Understanding the public discussion about the centers for disease control and prevention during the covid-19 pandemic using Twitter data: Text mining analysis study. J Med Internet Res, 23(2):e25108. DOI: 10.2196/25108.
Machado, M. G. and Colevati, J. (2021). Anticomunismo e Gramscismo Cultural no Brasil. Revista Aurora, 14(Edição Especial):23–34. Number: Edição Especial. DOI: 10.36311/1982-8004.2021.v14esp.p23-34.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316. DOI: 10.48550/ARXIV.2210.07316.
Murtagh, F. and Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? Journal of classification, 31:274–295.
Nguyen, D. Q., Vu, T., and Tuan Nguyen, A. (2020). BERTweet: A pre-trained language model for English tweets. In Liu, Q. and Schlangen, D., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 9–14, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.emnlp-demos.2.
Nisha and Kumar R, D. A. (2019). Implementation on text classification using bag of words model. SSRN Electron. J.
Paatero, P. and Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126. DOI: https://doi.org/10.1002/env.3170050203.
Panichella, A. (2021). A systematic comparison of search-based approaches for LDA hyperparameter tuning. Information and Software Technology, 130:106411. DOI: https://doi.org/10.1016/j.infsof.2020.106411.
Pham, C. M., Hoyle, A., Sun, S., Resnik, P., and Iyyer, M. (2024). TopicGPT: A prompt-based topic modeling framework.
Ramamoorthy, T., Kulothungan, V., and Mappillairaju, B. (2024). Topic modeling and social network analysis approach to explore diabetes discourse on twitter in india. Frontiers in Artificial Intelligence, 7. DOI: 10.3389/frai.2024.1329185.
Rao, V. K., Valdez, D., Muralidharan, R., Agley, J., Eddens, K. S., Dendukuri, A., Panth, V., and Parker, M. A. (2024). Digital epidemiology of prescription drug references on x (formerly twitter): Neural network topic modeling and sentiment analysis. J Med Internet Res, 26:e57885. DOI: 10.2196/57885.
Rehůřek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. pages 45–50. DOI: 10.13140/2.1.2393.1847.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Roberts, M. E., Stewart, B. M., Tingley, D., Airoldi, E. M., et al. (2013). The structural topic model and applied social science. In Advances in neural information processing systems workshop on topic models: computation, application, and evaluation, volume 4, pages 1–20. Harrahs and Harveys, Lake Tahoe.
Robila, M. and Robila, S. A. (2020). Applications of artificial intelligence methodologies to behavioral and social sciences. Journal of Child and Family Studies, 29(10):2954–2966.
Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15, page 399–408, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2684822.2685324.
Shadrova, A. (2021). Topic models do not model topics: epistemological remarks and steps towards best practices. Journal of Data Mining & Digital Humanities, 2021. DOI: 10.46298/jdmdh.7595.
Shyu, R. and Weng, C. (2024). Enabling semantic topic modeling on Twitter using MetaMap. AMIA Summits Transl. Sci. Proc., 2024:670–678.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).
Sridhar, V. K. R. (2015). Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st workshop on vector space modeling for natural language processing, pages 192–200.
Sumikawa, Y., Jatowt, A., and Düring, M. (2018). Digital history meets microblogging: Analyzing collective memories in Twitter. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL ’18, page 213–222, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3197026.3197057.
Teh, Y., Jordan, M., Beal, M., and Blei, D. (2004). Sharing clusters among related groups: Hierarchical Dirichlet processes. Advances in neural information processing systems, 17.
Urhan, C. (2024). Enhancing Semantic Understanding by Bridging Topic Modeling and Thematic Analysis: An Empirical Study on Self-Help Twitter Corpus and In-Depth Interviews, pages 53–71. Springer Nature Switzerland, Cham. DOI: 10.1007/978-3 031-48941-95.
Uthirapathy, S. E. and Sandanam, D. (2023). Topic modelling and opinion analysis on climate change Twitter data using LDA and BERT model. Procedia Computer Science, 218:908–917. International Conference on Machine Learning and Data Engineering. DOI: https://doi.org/10.1016/j.procs.2023.01.071.
Wagner, J., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazil-
ian Portuguese. Wang, Y.-X. and Zhang, Y.-J. (2013). Nonnegative Matrix Factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6):1336–1353. DOI: 10.1109/TKDE.2012.51.
Xu, W. W., Tshimula, J. M., Dubé, È., Graham, J. E., Greyson, D., MacDonald, N. E., and Meyer, S. B. (2022). Unmasking the Twitter discourses on masks during the covid-19 pandemic: User cluster–based BERT topic modeling approach. JMIR Infodemiology, 2(2):e41198. DOI: 10.2196/41198.
Xue, J., Chen, J., Hu, R., Chen, C., Zheng, C., Su, Y., and Zhu, T. (2020). Twitter discussions and emotions about the covid-19 pandemic: Machine learning approach. J Med Internet Res, 22(11):e20550. DOI: 10.2196/20550.
Yan, X., Guo, J., Lan, Y., and Cheng, X. (2013). A Biterm Topic Model for short texts. WWW ’13, page 1445–1456, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2488388.2488514.

