Pipelining Semantic Expansion and Noise Filtering for Sentiment Analysis of Short Documents – CluSent Method

Felipe Viegas; Sergio Canuto; Washington Cunha; Celso França; Claudio Valiense; Guilherme Fonseca; Ana Machado; Leonardo Rocha; Marcos André Gonçalves

doi:10.5753/jis.2024.4117

Authors

Felipe Viegas Universidade Federal de Minas Gerais https://orcid.org/0000-0001-8121-8607
Sergio Canuto Instituto Federal de Goiás https://orcid.org/0000-0003-2973-4158
Washington Cunha Universidade Federal de Minas Gerais https://orcid.org/0000-0002-1988-8412
Celso França Universidade Federal de Minas Gerais https://orcid.org/0000-0002-0251-7172
Claudio Valiense Universidade Federal de Minas Gerais https://orcid.org/0000-0002-7366-2633
Guilherme Fonseca Universidade Federal de São João del-Rei https://orcid.org/0009-0000-7862-8701
Ana Machado Universidade Federal de São João del-Rei https://orcid.org/0009-0000-2930-8795
Leonardo Rocha Universidade Federal de São João del-Rei https://orcid.org/0000-0002-4913-4902
Marcos André Gonçalves Universidade Federal de Minas Gerais https://orcid.org/0000-0002-2075-3363

DOI:

https://doi.org/10.5753/jis.2024.4117

Keywords:

Sentiment Analysis, Classification, Natural Language Processing

Abstract

The challenge of constructing effective sentiment models is exacerbated by a lack of sufficient information, particularly in short texts. Enhancing short texts with semantic relationships becomes crucial for capturing affective nuances and improving model efficacy, albeit with the potential drawback of introducing noise. This article introduces a novel approach, CluSent, designed for customized dataset-oriented sentiment analysis. CluSent capitalizes on the CluWords concept, a proposed powerful representation of semantically related words. To address the issues of information scarcity and noise, CluSent addresses these challenges: (i) leveraging the semantic neighborhood of pre-trained word embedding representations to enrich document representation and (ii) introducing dataset-specific filtering and weighting mechanisms to manage noise. These mechanisms utilize part-of-speech and polarity/intensity information from lexicons. In an extensive experimental evaluation spanning 19 datasets and five state-of-the-art baselines, including modern transformer architectures, CluSent emerged as the superior method in the majority of scenarios (28 out of 38 possibilities), demonstrating noteworthy performance gains of up to 14% over the strongest baselines.

Downloads

Download data is not yet available.

References

Abiola, O., Abayomi-Alli, A., Tale, O. A., Misra, S., and Abayomi-Alli, O. (2023). Sentiment analysis of covid-19 tweets from selected hashtags in nigeria using vader and text blob analyser. Journal of Electrical Systems and Information Technology, 10(1):5. DOI: https://doi.org/10.1186/s43067-023-00070-9.

Alissa, M., Haddad, I., Meyer, J., Obeid, J., Vilaetis, K., Wiecek, N., and Wongariyakavee, S. (2021). Sentiment analysis for open domain conversational agent. CoRR, abs/2101.00675. DOI: https://doi.org/10.48550/arXiv.2101.00675.

Aljedaani, W., Rustam, F., Mkaouer, M. W., Ghallab, A., Rupapara, V., Washington, P. B., Lee, E., and Ashraf, I. (2022). Sentiment analysis on twitter data integrating textblob and deep learning models: The case of us airline industry. Knowledge-Based Systems, 255:109780. DOI: https://doi.org/10.1016/j.knosys.2022.109780.

Amin, A., Hossain, I., Akther, A., and Alam, K. M. (2019). Bengali vader: A sentiment analysis approach using modified vader. In 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), pages 1–6. DOI: https://doi.org/10.1109/ECACE.2019.8679144.

Arkin, E. M., Banik, A., Carmi, P., Citovsky, G., Katz, M. J., Mitchell, J. S., and Simakov, M. (2018). Selecting and covering colored points. Discrete Applied Mathematics, 250:75–86. DOI: https://doi.org/10.1016/j.dam.2018.05.011.

Bommasani, R., Davis, K., and Cardie, C. (2020). Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4758–4781, Online. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.acl-main.431.

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023a). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 665–674. DOI: https://doi.org/10.1145/3539618.3591638.

Cunha, W., França, C., Rocha, L., and Gonçalves, M. A. (2023b). Tpdr: A novel two-step transformer-based product and class description match and retrieval method. arXiv preprint arXiv:2310.03491. DOI: https://doi.org/10.48550/arXiv.2310.03491.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. IP&M, 58(3):102481. DOI: https://doi.org/10.1016/j.ipm.2020.102481.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023c). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Computing Surveys. DOI: https://doi.org/10.1145/3582000.

de Andrade, C. M., Belém, F. M., Cunha, W., França, C., Viegas, F., Rocha, L., and Gonçalves, M. A. (2023). On the class separability of contextual embeddings representations–or “the classifier does not matter when the (text) representation is so good!”. Information Processing & Management, 60(4):103336. DOI: https://doi.org/10.1016/j.ipm.2023.103336.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. pages 4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423.

Edwards, A., Camacho-Collados, J., De Ribaupierre, H., and Preece, A. (2020). Go simple and pre-train on domain-specific corpora: On the role of training data for text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5522–5529. DOI: https://doi.org/10.18653/v1/2020.coling-main.481.

Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M. A., and Meira Jr., W. (2011). Word co-occurrence features for text classification. Inf. Syst., 36. DOI: https://doi.org/10.1016/j.is.2011.02.002.

Foster, C. and Kimia, B. (2023). Computational enhancements of hnsw targeted to very large datasets. In Pedreira, O. and Estivill-Castro, V., editors, Similarity Search and Applications, pages 291–299, Cham. Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-46994-7_25.

Go, A., Bhayani, R., and Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.

Hu, X., Sun, N., Zhang, C., and Chua, T.-S. (2009). Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of CIKM, pages 919–928. ACM. DOI: https://doi.org/10.1145/1645953.1646071.

Huang, Q., Chen, Z., Lu, Z., and Ye, Y. (2018). Analysis of bag-of-n-grams representation’s properties based on textual reconstruction. CoRR. DOI: https://doi.org/10.48550/arXiv.1809.06502.

Hutto, C. J. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM’14. DOI: https://doi.org/10.1609/icwsm.v8i1.14550.

Jin, Z., Zhao, X., and Liu, Y. (2021). Heterogeneous graph network embedding for sentiment analysis on social media. Cognitive Computation, 13(1):81–95. DOI: https://doi.org/10.1007/s12559-020-09793-7.

Jonker, R. A. A., Poudel, R., Fajarda, O., Matos, S., Oliveira, J. L., and Lopes, R. P. (2022). Portuguese twitter dataset on covid-19. In 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 332–338. DOI: https://doi.org/10.1109/ASONAM55673.2022.10068592.

Júnior, A. P. D. S., Cecilio, P., Viegas, F., Cunha, W., Albergaria, E. T. D., and Rocha, L. C. D. D. (2022). Evaluating topic modeling pre-processing pipelines for portuguese texts. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pages 191–201. DOI: https://doi.org/10.1145/3539637.3557052

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. JMLR., 5:361–397. DOI: https://doi.org/10.5555/1005332.1005345.

Loureiro, D. and Camacho-Collados, J. (2020). Don’t neglect the obvious: On the role of unambiguous words in word sense disambiguation. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3514–3520, Online. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.283.

Ma, T., Yao, J.-G., Lin, C.-Y., and Zhao, T. (2021). Issues with entailment-based zero-shot text classification. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 786–796, Online. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.acl-short.99.

Mabrouk, A., Redondo, R. P. D., and Kayed, M. (2020). Deep learning-based sentiment classification: A comparative survey. IEEE Access, 8:85616–85638. DOI: https://doi.org/10.1109/ACCESS.2020.2992013.

Malkov, Y. A. and Yashunin, D. A. (2020). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836. DOI: https://doi.org/10.1109/TPAMI.2018.2889473.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018). Advances in pre-training distributed word representations. In LREC’18. DOI: https://doi.org/10.48550/arXiv.1712.09405.

Nooralahzadeh, F., Øvrelid, L., and Lønning, J. T. (2018). Evaluation of Domain-specific Word Embeddings using Knowledge Resources. In chair), N. C. C., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, LREC’18, Miyazaki, Japan. ELRA.

Oyebode, O. and Orji, R. (2019). Social media and sentiment analysis: The nigeria presidential election 2019. In 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pages 0140–0146. DOI: https://doi.org/10.1109/IEMCON.2019.8936139.

Pang, B. and Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 271–278, Barcelona, Spain. DOI: https://doi.org/10.3115/1218955.1218990.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W., editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/D14-1162.

Puri, R. and Catanzaro, B. (2019). Zeroshot text classification with generative language models. CoRR, abs/1912.10165. DOI: https://doi.org/10.48550/arXiv.1912.10165.

Qi, Y. and Shabrina, Z. (2023). Sentiment analysis using twitter data: a comparative application of lexicon- and machine-learning-based approach. Social Network Analysis and Mining, 13(1):31. DOI: https://doi.org/10.1007/s13278-023-01030-x.

Ribeiro, F. N., Araújo, M., Gonçalves, P., Gonçalves, M. A., and Benevenuto, F. (2016). Sentibench: A benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science, 5(1):1–29. DOI: https://doi.org/10.1140/epjds/s13688-016-0085-1.

Rosenthal, S., Farra, N., and Nakov, P. (2019). Semeval-2017 task 4: Sentiment analysis in twitter. CoRR, abs/1912.00741. DOI: https://doi.org/10.18653/v1/S17-2088.

Sachan, D. S., Zaheer, M., and Salakhutdinov, R. (2019). Revisiting lstm networks for semi-supervised text classification via mixed objective function. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6940–6948. DOI: https://doi.org/10.1609/aaai.v33i01.33016940.

Shaik, T., Tao, X., Dann, C., Xie, H., Li, Y., and Galligan, L. (2023). Sentiment analysis and opinion mining on educational data: A survey. Natural Language Processing Journal, 2:100003. DOI: https://doi.org/10.1016/j.nlp.2022.100003.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP’19, pages 1631–1642, Seattle, Washington, USA. ACL.

Thongtan, T. and Phienthrakul, T. (2019). Sentiment classification using document embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 407–414, Florence, Italy. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19-2057.

Tymann, K., Lutz, M., Palsbröker, P., and Gips, C. (2019). Gervader - A german adaptation of the VADER sentiment analysis tool for social media texts. In Jäschke, R. and Weidlich, M., editors, Proceedings of the Conference on ”Lernen, Wissen, Daten, Analysen”, Berlin, Germany, September 30 - October 2, 2019, volume 2454 of CEUR Workshop Proceedings, pages 178–189. CEUR-WS.org.

Viegas, F., Alvim, M. S., Canuto, S., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020a). Exploiting semantic relationships for unsupervised expansion of sentiment lexicons. Information Systems, 94:101606. DOI: https://doi.org/10.1016/j.is.2020.101606.

Viegas, F., Canuto, S., Cunha, W., França, C., Valiense, C., Rocha, L., and Gonçalves, M. A. (2023). Clusent – combining semantic expansion and de-noising for dataset-oriented sentiment analysis of short texts. In Proceedings of the 29th Brazilian Symposium on Multimedia and the Web, WebMedia ’23, page 110–118, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3617023.3617039.

Viegas, F., Canuto, S., Gomes, C., Luiz, W., Rosa, T., Ribas, S., Rocha, L., and Gonçalves, M. A. (2019). Cluwords: Exploiting semantic word clustering representation for enhanced topic modeling. In Proceedings of WSDM ’19, pages 753–761. DOI: https://doi.org/10.1145/3289600.3291032.

Viegas, F., Cunha, W., Gomes, C., Pereira, A., Rocha, L., and Goncalves, M. (2020b). CluHTM - semantic hierarchical topic modeling based on CluWords. In Proc. of the 58th Annual Meeting of the Assoc. for Computational Linguistics (ACL 2020), pages 8138–8150. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.acl-main.724.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Linzen, T., Chrupała, G., and Alishahi, A., editors, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W18-5446.

Wang, Y., Yin, F., Liu, J., and Tosato, M. (2020). Automatic construction of domain sentiment lexicon for semantic disambiguation. Multim. Tools Appl., 79(31-32):22355–22373. DOI: https://doi.org/10.1007/s11042-020-09030-1.

Yin, D., Meng, T., and Chang, K.-W. (2020). SentiBERT: A transferable transformer-based architecture for compositional sentiment semantics. In Proceedings of the 58th Conference of the Association for Computational Linguistics, ACL 2020, Seattle, USA. DOI: https://doi.org/10.18653/v1/2020.acl-main.341.

Yin, W., Hay, J., and Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923, HongKong, China. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/D19-1404.

Yong, Z. X., Schoelkopf, H., Muennighoff, N., Aji, A. F., Adelani, D. I., Almubarak, K., Bari, M. S., Sutawika, L., Kasai, J., Baruwa, A., Winata, G., Biderman, S., Raff, E., Radev, D., and Nikoulina, V. (2023). BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11682–11703, Toronto, Canada. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2023.acl-long.653.