Extracting Features from Text Flows based on Semantic Similarity for Text Classification: an Approach Inspired by Audio Analysis

Larissa Lucena Vasconcelos; Claudio E. C. Campelo

doi:10.5753/jbcs.2024.3759

Authors

Larissa Lucena Vasconcelos Federal Institute of Paraíba https://orcid.org/0000-0003-4942-7649
Claudio E. C. Campelo Federal University of Campina Grande https://orcid.org/0000-0003-4404-2344

DOI:

https://doi.org/10.5753/jbcs.2024.3759

Keywords:

NLP, Semantic Similarity, Text Classification, TextFlow, Lexicon-based Representation

Abstract

Text classification is a mainly investigated challenge in Natural Language Processing (NLP) research. The higher performance of a classification model depends on a representation that can extract valuable information about the texts. Aiming not to lose crucial local text information, a way to represent texts is through flows, sequences of information collected from texts. This paper proposes an approach that combines various techniques to represent texts: the representation by flows, the benefit of the word embeddings text representation associated with lexicon information via semantic similarity distances, and the extraction of features inspired by well-established audio analysis features.
In order to perform text classification, this approach splits the text into sentences and calculates a semantic similarity metric to a lexicon on an embedding vector space. The sequence of semantic similarity metrics composes the text flow. Then, the method performs the extraction of twenty-five features inspired by audio analysis (named Audio-Like Features). The features adaptation from audio analysis comes from a similitude between a text flow and a digital signal, in addition to the existing relationship between text, speech, and audio. We evaluated the method in three NLP classification tasks: Fake News Detection in English, Fake News Detection in Portuguese, and Newspaper Columns versus News Classification. The approach efficacy is compared to baselines that embed semantics in text representation: the Paragraph Vector and the BERT. The objective of the experiments was to investigate if the proposed approach could compete with the baselines methods improve their efficacy when associated with them. The experimental evaluation demonstrates that the association between the proposed and the baseline methods can enhance the baseline classification efficacy in all three scenarios. In the Fake News Detection in Portuguese task, our approach surpassed the baselines and obtained the best effectiveness (PR-AUC = 0.98).

Downloads

Download data is not yet available.

References

Aggarwal, C. C. (2018). Machine Learning for Text. Springer Publishing Company, Incorporated, 1st edition. DOI: 10.1007/978-3-319-73531-3.

Aggarwal, C. C. and Zhai, C. X. (2012). Mining Text Data. Springer Publishing Company, Incorporated. DOI: 10.1007/978-1-4614-3223-4.

Aker, A., Gravenkamp, H., Mayer, S., Hamacher, M., Smets, A., Nti, A., Erdmann, J., Serong, J., Welpinghus, A., and Marchi, F. (2019). Corpus of news articles annotated with article level subjectivity. Available online [link].

Alías, F., Socoró, J. C., and Sevillano, X. (2016). A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences, 6(5). DOI: 10.3390/app6050143.

Amorim, E., Cançado, M., and Veloso, A. (2018). Automated essay scoring in the presence of biased ratings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 229-237, New Orleans, Louisiana. Association for Computational Linguistics. DOI: 10.18653/v1/N18-1021.

Araque, O., Zhu, G., and Iglesias, C. A. (2019). A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowledge-Based Systems, 165:346 - 359.

Asr, F. T. and Taboada, M. (2019). Big data and quality data for fake news and misinformation detection. Big Data & Society, 6(1):2053951719843310. DOI: 10.1177/2053951719843310.

Avanço, L. V. and Nunes, M. d. G. V. (2014). Lexicon-based sentiment analysis for reviews of products in brazilian portuguese. In 2014 Brazilian Conference on Intelligent Systems, pages 277-281. DOI: 10.1109/BRACIS.2014.57.

Baert, G., Gahbiche, S., Gadek, G., and Pauchet, A. (2020). Arabizi language models for sentiment analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 592-603, Barcelona, Spain (Online). International Committee on Computational Linguistics. DOI: 10.18653/v1/2020.coling-main.51.

Bao, L., Lambert, P., and Badia, T. (2019). Attention and lexicon regularized LSTM for aspect-based sentiment analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 253-259, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-2035.

Bhowmik, N. R., Arifuzzaman, M., and Mondal, M. R. H. (2022). Sentiment analysis on bangla text using extended lexicon dictionary and deep learning algorithms. Array, 13:100123. DOI: 10.1016/j.array.2021.100123.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. DOI: 10.1162/tacl_a_00051.

Briskilal, J. and Subalalitha, C. (2022). An ensemble model for classifying idioms and literal texts using bert and roberta. Information Processing & Management, 59(1):102756. DOI: 10.1016/j.ipm.2021.102756.

Cho, Y. D., Kim, M. Y., and Kim, S. R. (1998). A spectrally mixed excitation (smx) vocoder with robust parameter determination. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), volume 2, pages 601-604 vol.2. DOI: 10.1109/ICASSP.1998.675336.

Choi, Y. and Wiebe, J. (2014). +/-EffectWordNet: Sense-level lexicon acquisition for opinion inference. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1181-1191, Doha, Qatar. Association for Computational Linguistics. DOI: 10.3115/v1/D14-1125.

Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML '06, pages 233-240, New York, NY, USA. ACM. DOI: 10.1145/1143844.1143874.

Dev, S., Li, T., Phillips, J. M., and Srikumar, V. (2020). On measuring and mitigating biased inferences of word embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7659-7666. DOI: 10.1609/aaai.v34i05.6267.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423.

Dharma, E. M., Gaol, F. L., Warnars, H., and Soewito, B. (2022). The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification. J Theor Appl Inf Technol, 100(2):31. Available online [link].

Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895-1923. DOI: 10.1162/089976698300017197.

El-Maleh, K., Klein, M., Petrucci, G., and Kabal, P. (2000). Speech/music discrimination for multimedia applications. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), volume 4, pages 2445-2448 vol.4. DOI: 10.1109/ICASSP.2000.859336.

Farrús, M., Hernando, J., and Ejarque, P. (2007). Jitter and shimmer measurements for speaker recognition. pages 778-781. DOI: 10.21437/Interspeech.2007-147.

Feldman, R. and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. DOI: 10.1017/CBO9780511546914.

Filatova, E. (2017). Sarcasm detection using sentiment flow shifts. In Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2017, Marco Island, Florida, USA, May 22-24, 2017, pages 264-269. Available online [link].

Freitas, C. (2013). Sobre a construção de um léxico da afetividade para o processamento computacional do português. In Rev. bras. linguist. apl., volume 13, pages 1031-1059. DOI: 10.1590/S1984-63982013005000024.

Fu, X., Yang, J., Li, J., Fang, M., and Wang, H. (2018). Lexicon-enhanced lstm with attention for general sentiment analysis. IEEE Access, 6:71884-71891. DOI: 10.1109/ACCESS.2018.2878425.

Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information, 13(2). DOI: 10.3390/info13020083.

Ghanem, B., Ponzetto, S. P., Rosso, P., and Rangel, F. (2021). FakeFlow: Fake news detection by modeling the flow of affective information. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 679-689, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.eacl-main.56.

Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., and Tserpes, K. (2012). Representation models for text classification: A comparative analysis over three web document types. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, WIMS '12, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2254129.2254148.

Giannakopoulos, T. and Pikrakis, A. (2014). Chapter 4 - audio features. In Giannakopoulos, T. and Pikrakis, A., editors, Introduction to Audio Analysis, pages 59 - 103. Academic Press, Oxford. DOI: 10.1016/B978-0-08-099388-1.00004-2.

Goldberg, Y. and Hirst, G. (2017). Neural Network Methods in Natural Language Processing. Morgan and Claypool Publishers. Book.

Horne, B. D. and Adali, S. (2017). This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news.

ISO/IEC (2002). Information Technology - Multimedia Content Description In- terface - part 4: Audio. ISO/IEC, Moving Pictures Expert Group, 1st edition. Available online [link].

Jang, B., Kim, M., Harerimana, G., Kang, S.-u., and Kim, J. (2020). Bi-lstm model to increase accuracy in text classification: Combining word2vec cnn and attention mechanism. Applied Sciences, 10:5841. DOI: 10.3390/app10175841.

Jeronimo, C., Campelo, C., Marinho, L., Sales, A., Veloso, A., and Viola, R. (2020). Computing with subjectivity lexicons. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France. European Language Resources Association. Available online [link].

Jeronimo, C., Marinho, L., Campelo, C., Veloso, A., and Melo, A. (2019). Fake news classification based on subjective language. In Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services. DOI: 10.1145/3366030.3366039.

Jin, P., Zhang, Y., Chen, X., and Xia, Y. (2016). Bag-of-embeddings for text classification. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI'16, page 2824–2830. AAAI Press. Available online [link].

Khadhraoui, M., Bellaaj, H., Ammar, M. B., Hamam, H., and Jmaiel, M. (2022). Survey of bert-base models for scientific text classification: Covid-19 case study. Applied Sciences, 12(6). DOI: 10.3390/app12062891.

Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. SpringerLink : Bücher. Springer New York. Book.

Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 957–966. JMLR.org. Available online [link].

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282-289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Available online [link].

Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, page II–1188–II–1196. JMLR.org. Available online [link].

Lee, S.-W., Lee, J.-T., Song, Y.-I., and Rim, H.-C. (2010). High precision opinion retrieval using sentiment-relevance flows. pages 817-818. DOI: 10.1145/1835449.1835631.

Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., and He, L. (2022). A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol., 13(2). DOI: 10.1145/3495162.

Li, T. and Ogihara, M. (2005). Music genre classification with taxonomy. volume 5, pages v/197 - v/200 Vol. 5. DOI: 10.1109/ICASSP.2005.1416274.

Liang, Q., Mu, J., Wang, W., and Zhang, B. (2017). Communications, Signal Processing, and Systems: Proceedings of the 2016 International Conference on Communications, Signal Processing, and Systems. Springer Publishing Company, Incorporated, 1st edition. Book.

Liu, Z., Wang, Y., and Chen, T. (1998). Audio feature extraction and analysis for scene segmentation and classification. Journal of VLSI Signal Processing, 20. DOI: 10.1023/A:1008066223044.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.. DOI: 10.48550/arXiv.1705.07874.

Maharjan, S., Kar, S., Montes, M., González, F. A., and Solorio, T. (2018). Letting emotions flow: Success prediction by modeling the flow of emotions in books. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 259-265, New Orleans, Louisiana. Association for Computational Linguistics. DOI: 10.18653/v1/N18-2042.

Mao, Y. and Lebanon, G. (2007). Isotonic conditional random fields and local sentiment flow. In Schölkopf, B., Platt, J. C., and Hoffman, T., editors, Advances in Neural Information Processing Systems 19, pages 961-968. MIT Press. Available online [link].

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153-157. DOI: 10.1007/BF02295996.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. Available online [link].

Mitrovic, D., Zeppelzauer, M., and Breiteneder, C. (2006). Discrimination and retrieval of animal sounds. In 2006 12th International Multi-Media Modelling Conference, pages 5 pp.-. DOI: 10.1109/MMMC.2006.1651344.

Mitrović, D., Zeppelzauer, M., and Breiteneder, C. (2010a). Chapter 3 - features for content-based audio retrieval. In Advances in Computers: Improving the Web, volume 78 of Advances in Computers, pages 71-150. Elsevier. DOI: 10.1016/S0065-2458(10)78003-7.

Mitrović, D., Zeppelzauer, M., and Breiteneder, C. (2010b). Chapter 3 - features for content-based audio retrieval. In Advances in Computers: Improving the Web, volume 78 of Advances in Computers, pages 71-150. Elsevier. DOI: 10.1016/S0065-2458(10)78003-7.

Mutinda, J., Mwangi, W., and Okeyo, G. (2023). Sentiment analysis of text reviews using lexicon-enhanced bert embedding (lebert) model with convolutional neural network. Applied Sciences, 13(3). DOI: 10.3390/app13031445.

Muñoz, S. and Iglesias, C. A. (2022). A text classification approach to detect psychological stress combining a lexicon-based feature framework with distributional representations. Information Processing & Management, 59(5):103011. DOI: 10.1016/j.ipm.2022.103011.

Pawar, A. and Mago, V. (2019). Challenging the boundaries of unsupervised learning for semantic similarity. IEEE Access, 7:16291-16308. DOI: 10.1109/ACCESS.2019.2891692.

Peeters, G. (2004). A large set of audio features for sound description (similarity and classification) in the cuidado project. Available online [link].

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543. Available online [link].

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227-2237, New Orleans, Louisiana. Association for Computational Linguistics. DOI: 10.18653/v1/N18-1202.

Rabiner, L. and Schafer, R. (1978). Digital Processing of Speech Signals. Englewood Cliffs: Prentice Hall. Book.

Rabiner, L. and Schafer, R. (2010). Theory and Applications of Digital Speech Processing. Prentice Hall Press, USA, 1st edition. Available online [link].

Rabiner, L. R. and Schafer, R. W. (2007). Introduction to digital speech processing. Found. Trends Signal Process., 1(1):1–194. DOI: 10.1561/2000000001.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. Available online [link].

Ramalingam, A. and Krishnan, S. S. (2006). Gaussian mixture modeling of short-time fourier transform features for audio fingerprinting. IEEE Transactions on Information Forensics and Security, 1:457-463. DOI: 10.1109/TIFS.2006.885036.

Recasens, M., Danescu-Niculescu-Mizil, C., and Jurafsky, D. (2013). Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1650-1659, Sofia, Bulgaria. Association for Computational Linguistics. Available online [link].

Saito, T. and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. In PloS one. DOI: 10.1371/journal.pone.0118432.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1–47. DOI: 10.1145/505282.505283.

Seo, J. and Jeon, J. (2009). High precision retrieval using relevance-flow graph. In Proceedings of the 32Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '09, pages 694-695, New York, NY, USA. ACM. DOI: 10.1145/1571941.1572082.

Shannon, C. E. (2001). A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1):3–55. DOI: 10.1145/584091.584093.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631-1642, Seattle, Washington, USA. Association for Computational Linguistics. Available online [link].

Tumitan, D. and Becker, K. (2013). Tracking sentiment evolution on user-generated content: A case study on the brazilian political scene. In SBBD. Available online [link].

Vasconcelos, L., Campelo, C., and Jeronimo, C. (2020). Aspect flow representation and audio inspired analysis for texts. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 1469-1477, Marseille, France. European Language Resources Association. Available online [link].

Wachsmuth, H. and Stein, B. (2017). A universal model for discourse-level argumentation analysis. ACM Trans. Internet Technol., 17(3):28:1-28:24. DOI: 10.1145/2957757.

Wilson, T., Wiebe, J., and Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, page 347–354, USA. Association for Computational Linguistics. DOI: 10.3115/1220575.1220619.

Wu, C., Wu, F., Liu, J., Huang, Y., and Xie, X. (2019). Sentiment lexicon enhanced neural sentiment classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 1091–1100, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3357384.3357973.

Yu, D. and Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Approach. Signals and Communication Technology. Springer, London. DOI: 10.1007/978-1-4471-5779-3.

Zhang, T. and Kuo, C.-C. (2001). Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4):441-457. DOI: 10.1109/89.917689.