Sentiment Analysis of Shared Content in Brazilian Reddit Communities

Giovana Piorino; Vitor Moreira; Luiz Henrique Quevedo Lima; Ana Clara Souza Pagano; Adriana Silvina Pagano; Ana Paula Couto da Silva

doi:10.5753/jis.2025.5564

Authors

Giovana Piorino Federal University of Minas Gerais https://orcid.org/0009-0005-6049-860X
Vitor Moreira Federal University of Minas Gerais https://orcid.org/0009-0001-0694-5414
Luiz Henrique Quevedo Lima Federal University of Minas Gerais https://orcid.org/0009-0000-3440-8037
Ana Clara Souza Pagano Federal University of Minas Gerais https://orcid.org/0000-0001-7685-9928
Adriana Silvina Pagano Federal University of Minas Gerais https://orcid.org/0000-0002-3150-3503
Ana Paula Couto da Silva Federal University of Minas Gerais https://orcid.org/0000-0001-5951-3562

DOI:

https://doi.org/10.5753/jis.2025.5564

Keywords:

Sentiment Analysis, Classification Models, Human Labeling, Linguistic Analysis

Abstract

The growth of social media in the present decade is one of the main drivers of studies on user-generated content. Reddit, a social network that has been gaining popularity among Brazilians, has become a source for sentiment analysis studies aimed at evaluating automated models for this task. This article reports a study on the development and evaluation of a dataset of human-annotated Reddit comments and its comparison with sentiment classification models. Comments retrieved from Brazilian Reddit communities were labeled by annotators and submitted to automated classification using 10 models with different architectures. Human labeling showed moderate agreement coefficients and reasonable disagreement, highlighting the subjectivity of the task. Models based on LLMs and BERT performed well with Brazilian Portuguese texts. The comparison revealed similarities in the challenges faced by humans and models, suggesting opportunities to improve automated language understanding. Both humans and models face similar difficulties in sentiment assignment, language characteristics of the texts being a major challenge for model classification, which points to the need for further advancement in this respect.

Downloads

Download data is not yet available.

References

Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá-3 technical report. DOI: https://doi.org/10.48550/arXiv.2410.12049.

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Bender, E. M., Derczynski, L., and Isabelle, P., editors, Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics. [link]. Access on 12 August 2025.

Almeida, R. J. A. (2018). Leia - léxico para inferência adaptada. [link]. Access on 12 August 2025.

Amedie, J. (2015). The impact of social media on society. Advanced Writing: Pop Culture Intersections. [link]. Access on 12 August 2025.

Barbieri, F., Espinosa Anke, L., and Camacho-Collados, J. (2022). XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 258–266, Marseille, France. European Language Resources Association. DOI: https://doi.org/10.48550/arXiv.2104.12250.

Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., and Blackburn, J. (2020). The pushshift reddit dataset. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 830–839. DOI: https://doi.org/10.48550/arXiv.2001.08435.

Bibi, A., Ihsan, U., Ashraf, H., and Jhanjhi, N. (2024). Multilingual sentiment analysis using deep learning: Survey. Preprints. DOI: https://doi.org/10.1109/ICSSIT55814.2023.10060993.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. DOI: https://doi.org/10.48550/arXiv.2005.14165.

Brum, H. and das Graças Volpe Nunes, M. (2018). Building a Sentiment Corpus of Tweets in Brazilian Portuguese. In chair), N. C. C., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). DOI: https://doi.org/10.48550/arXiv.1712.08917.

Corso, F., Russo, G., and Pierri, F. (2024). A longitudinal study of Italian and French reddit conversations around the Russian invasion of Ukraine. In ACM Web Science Conference, Websci ’24, page 22–30. ACM. DOI: https://doi.org/10.48550/arXiv.2402.04999.

Costa, P. B., Pavan, M. C., Santos, W. R., Silva, S. C., and Paraboni, I. (2023). BERTabaporu: Assessing a genre-specific language model for Portuguese NLP. In Mitkov, R. and Angelova, G., editors, Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 217–223, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria. DOI: https://doi.org/10.26615/978-954-452-092-2_024.

da Silva Oliveira, A., de Carvalho Cecote, T., Alvarenga, J. P. R., de Souza Freitas, V. L., and da Silva Luz, E. J. (2024). Toxic speech detection in Portuguese: A comparative study of large language models. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 108–116, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics. [link]. Access on 12 August 2025.

Dang, N. C., Moreno-García, M. N., and De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3). DOI: https://doi.org/10.3390/electronics9030483.

Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., and Ravi, S. (2020). Goemotions: A dataset of fine-grained emotions. arXiv preprint. DOI: https://doi.org/10.48550/arXiv.2005.00547.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/N19-1423.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378. DOI: https://doi.org/10.1037/h0031619.

Fleiss, J. L. (1975). Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31(3):651–659. DOI: https://doi.org/10.2307/2529549.

Fonseca, E., Santos, L., Criscuolo, M., and Aluisio, S. (2016). Assin: Avaliação de similaridade semântica e inferência textual. In Computational Processing of the Portuguese Language - 12th International Conference, Tomar, Portugal, pages 13–15. [link]. Access on 12 August 2025.

Fornaciari, T., Uma, A., Paun, S., Plank, B., Hovy, D., and Poesio, M. (2021). Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y., editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2591–2597, Online. Association for Computational Linguistics. DOI: http://dx.doi.org/10.18653/v1/2021.naacl-main.204.

Freitas, C., Rocha, P., and Bick, E. (2008). A new world in floresta sintá(c)tica – the Portuguese treebank. Calidoscópio, 6(3):142–148. DOI: https://doi.org/10.4013/cld.20083.03.

Frenda, S., Pedrani, A., Basile, V., Lo, S. M., Cignarella, A. T., Panizzon, R., Marco, C., Scarlini, B., Patti, V., Bosco, C., and Bernardi, D. (2023). EPIC: Multi-perspective annotation of a corpus of irony. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13844–13857, Toronto, Canada. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2023.acl-long.774.

Garcia, K. and Berton, L. (2021). Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA. Applied Soft Computing, 101:107057. DOI: https://doi.org/10.1016/j.asoc.2020.107057.

Gilbert, C. H. E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14). [link]. Access on 12 August 2025.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794. DOI: https://doi.org/10.48550/arXiv.2203.05794.

Herculano, A., de Paula, T.-H., Fernandes, D., and Rego, A. (2024). Depreredditbr: Um conjunto de dados textuais com postagens depressivas no idioma português brasileiro. In Anais do VI Dataset Showcase Workshop, pages 77–90, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/dsw.2024.243994.

Hinojosa Lee, M. C., Braet, J., and Springael, J. (2024). Performance metrics for multilabel emotion classification: Comparing micro, macro, and weighted f1-scores. Applied Sciences, 14(21). DOI: https://doi.org/10.3390/app14219863.

Hutto, C. and Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1):216–225. DOI: https://doi.org/10.1609/icwsm.v8i1.14550.

Júnior, A. P. D. S., Cecilio, P., Viegas, F., Cunha, W., Albergaria, E. T. D., and Rocha, L. C. D. D. (2022). Evaluating topic modeling pre-processing pipelines for Portuguese texts. In Proceedings of the Brazilian Symposium on Multimedia and the Web, WebMedia ’22, page 191–201, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3539637.3557052.

Kemp, S. (2024). Digital 2024 april global statshot report. [link]. Access on 12 August 2025.

Kingma, D. P. and Ba, J. (2017). Adam: A method for stochastic optimization. DOI: https://doi.org/10.48550/arXiv.1412.6980.

Koncar, P., Walk, S., and Helic, D. (2021). Analysis and prediction of multilingual controversy on reddit. In Proceedings of the 13th ACM Web Science Conference 2021, WebSci ’21, page 215–224, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3447535.3462481.

Kramer, A. D. I., Guillory, J. E., and Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24). DOI: https://doi.org/10.1073/pnas.1320040111.

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology. Sage Publications. DOI: https://doi.org/10.4135/9781071878781.

Lima, L. H. Q., Pagano, A. S., and da Silva, A. P. C. (2024). Toxic content detection in online social networks: a new dataset from Brazilian Reddit communities. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 472–482, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics. [link]. Access on 12 August 2025.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. DOI: https://doi.org/10.48550/arXiv.1705.07874.

Martella, M., Bert, F., Colli, G., Lo Moro, G., Pagani, A., Tatti, R., Scaioli, G., and Siliquini, R. (2021). Consequences of cyberaggression on social network on mental health of Italian adults. European Journal of Public Health, 31. DOI: https://doi.org/10.1093/eurpub/ckab165.589.

May, P. (2021). Machine translated multilingual sts benchmark dataset. [link]. Access on 12 August 2025.

Melton, C. A., Olusanya, O. A., Ammar, N., and Shaban-Nejad, A. (2021). Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the Reddit social media platform: A call to action for strengthening vaccine confidence. Journal of Infection and Public Health, 14(10):1505–1512. Special Issue on COVID-19 – Vaccine, Variants and New Waves. DOI: https://doi.org/10.1016/j.jiph.2021.08.010.

Mokhberian, N., Marmarelis, M. G., Hopp, F. R., Basile, V., Morstatter, F., and Lerman, K. (2023). Capturing perspectives of crowdsourced annotators in subjective learning tasks. arXiv preprint arXiv:2311.09743. DOI: https://doi.org/10.48550/arXiv.2311.09743.

Mueller, A. (2024). wordcloud. [link]. Access on 12 August 2025.

Mughal, N., Mujtaba, G., Shaikh, S., Kumar, A., and Daudpota, S. M. (2024). Comparative analysis of deep natural networks and large language models for aspect-based sentiment analysis. IEEE Access, 12:60943–60959. DOI: https://doi.org/10.1109/ACCESS.2024.3386969.

Nandurkar, T., Nagare, S., Hake, S., and Chinnaiah, K. (2023). Sentiment analysis towards Russia - Ukrainian conflict: Analysis of comments on Reddit. In 2023 11th International Conference on Emerging Trends in Engineering Technology - Signal and Information Processing (ICETET - SIP), pages 1–6. DOI: https://doi.org/10.1109/ICETETSIP58143.2023.10151571.

NLTK (2023a). Nltk - sample usage for tokenize. [link]. Access on 12 August 2025.

NLTK (2023b). Nltk - stopwords. [link]. Access on 12 August 2025.

Nothman, J., Ringland, N., Radford, W., Murphy, T., and Curran, J. R. (2013). Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, 194:151–175. DOI: https://doi.org/10.1016/j.artint.2012.03.006.

Oliveira, D. N. d., Utsch, M. N. R., Machado, D. V. P. d. A., Pena, N. G., Oliveira, R. G. D. d., Carvalho, A. I. R., and Merschmann, L. H. d. C. (2023). Evaluating a new auto-ml approach for sentiment analysis and intent recognition tasks. Journal on Interactive Systems, 14(1):92–105. DOI: https://doi.org/10.5753/jis.2023.3161.

OpenAI (2024). GPT-4 technical report. DOI: https://doi.org/10.48550/arXiv.2303.08774.

Pablo Botton da Costa (2022). bertabaporu-base-uncased (revision 1982d0f). DOI: https://doi.org/10.57967/hf/0019.

Parmar, M., Mishra, S., Geva, M., and Baral, C. (2023). Don’t blame the annotator: Bias already starts in the annotation instructions. In Vlachos, A. and Augenstein, I., editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1779–1789, Dubrovnik, Croatia. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2023.eacl-main.130.

Pereira, D. A. (2021). A survey of sentiment analysis in the Portuguese language. Artif. Intell. Rev., 54(2):1087–1115. DOI: https://doi.org/10.1007/s10462-020-09870-1.

Pereira, R., Alves, A., Vidal, D., Moura, F., Cabral, L., Paulino, R., Serrufo, M., and Figueiredo, K. (2023). Análise de sentimento de postagens de usuários no Twitter combinando GPT-3 e aprendizado de máquina: Um estudo de caso sobre o 2º turno das eleições presidenciais brasileiras. In Anais do XIV Workshop sobre Aspectos da Interação Humano-Computador para a Web Social, pages 20–27, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/waihcws.2023.233507.

Petrov, S., Das, D., and McDonald, R. (2011). A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086. DOI: https://doi.org/10.48550/arXiv.1104.2086.

Piao, S., Bianchi, F., Dayrell, C., D’Egidio, A., and Rayson, P. (2015). Development of the multilingual semantic annotation system. In Mihalcea, R., Chai, J., and Sarkar, A., editors, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1268–1274, Denver, Colorado. Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/N15-1137.

Piorino, G., Moreira, V., Lima, L., Pagano, A., and Silva, A. (2024). Análise de sentimentos de conteúdo compartilhado em comunidades brasileiras do Reddit: Avaliação de um conjunto de dados rotulados por humanos. In Proceedings of the 30th Brazilian Symposium on Multimedia and the Web, pages 54–62, Porto Alegre, RS, Brasil. SBC. DOI: https://doi.org/10.5753/webmedia.2024.242020.

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese Large Language Models, page 226–240. Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-45392-2_15.

Pérez, J. M., Rajngewerc, M., Giudici, J. C., Furman, D. A., Luque, F., Alemany, L. A., and Martínez, M. V. (2024). pysentimiento: A python toolkit for opinion mining and social NLP tasks. DOI: https://doi.org/10.48550/arXiv.2106.09462.

Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva, V. (2017). Universal dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), pages 197–206, Pisa, Italy. [link]. Access on 12 August 2025.

Real, L., Fonseca, E., and Oliveira, H. G. (2020). The ASSIN 2 shared task: A quick overview. In International Conference on Computational Processing of the Portuguese Language, pages 406–412. Springer. DOI: https://doi.org/10.1007/978-3-030-41505-1_39.

Reddit (2023). Transparency report: July to December 2023. [link]. Access on 12 August 2025.

Rosillo-Rodes, Pablo, M. M. S. and Sánchez, D. (2025). Entropy and type-token ratio in gigaword corpora. Phys. Rev. Res., pages –. DOI: https://doi.org/10.1103/rxxz-lk3n.

Siddiqui, S. and Singh, T. (2016). Social media its impact with positive and negative aspects. International Journal of Computer Applications Technology and Research, 5:71–75. DOI: http://dx.doi.org/10.7753/IJCATR0502.1006.

Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing Management, 45(4):427–437. DOI: https://doi.org/10.1016/j.ipm.2009.03.002.

Souza, C. N., Martínez-Arribas, J., Correia, R. A., Almeida, J. A., Ladle, R., Vaz, A. S., and Malhado, A. C. (2024). Using social media and machine learning to understand sentiments towards Brazilian national parks. Biological Conservation, 293:110557. DOI: https://doi.org/10.1016/j.biocon.2024.110557.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23. DOI: http://dx.doi.org/10.1007/978-3-030-61377-8_28.

spaCy (2023). Portuguese models. [link]. Access on 12 August 2025.

Tallarida, R. J. and Murray, R. B. (1987). Mann-Whitney Test, pages 149–153. Springer New York, New York, NY. DOI: https://doi.org/10.1007/978-1-4612-4974-046.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc. DOI: https://doi.org/10.48550/arXiv.1706.03762.

Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). [link]. Access on 12 August 2025.

Wu, Y. and Wan, J. (2025). A survey of text classification based on pre-trained language model. Neurocomputing, 616:128921. DOI: https://doi.org/10.1016/j.neucom.2024.128921.

X (2024). Dsa transparency report - april 2024. [link]. Access on 12 August 2025.

Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Abrego, G. H., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., and Kurzweil, R. (2019). Multilingual universal sentence encoder for semantic retrieval. DOI: https://doi.org/10.48550/arXiv.1907.04307.

Zhang, W., Deng, Y., Liu, B., Pan, S. J., and Bing, L. (2023). Sentiment analysis in the era of large language models: A reality check. DOI: https://doi.org/10.48550/arXiv.2305.15005.

Zhang, X., Qi, X., and Teng, Z. (2025). Performance evaluation of reddit comments using machine learning and natural language processing methods in sentiment analysis. In Zhou, K., editor, Computational and Experimental Simulations in Engineering, pages 14–24, Cham. Springer Nature Switzerland. DOI: https://doi.org/10.48550/arXiv.2405.16810.

Sentiment Analysis of Shared Content in Brazilian Reddit Communities

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Metrics: