Semantic Coherence of Short Text at the Word Level
DOI:
https://doi.org/10.5753/jbcs.2025.3705Keywords:
Semantic Coherence, Short Text, Word Semantics, Language Models, Contextualized EmbeddingsAbstract
Most text coherence models proposed in the literature focus on sentence ordering and semantic similarity of neighboring sentences. Thus, they cannot be applied to documents with just one sentence and do not properly look at incoherences caused by particular words. This work, on the other hand, focuses on word coherence in short texts. It proposes a framework called COHEWL (COHErence at Word Level) for assessing short document coherence at the word semantic level. COHEWL also support contrastive data generation by exchanging particular words with other ones that may fit in the context of the respective documents. Experiments with single-sentence questions typical of QA in Brazilian Portuguese and English were conducted. BERT, properly trained for a new task proposed in this paper – discriminating original documents from those with a changed word – achieves accuracy between 80% and 99.88%. However, our experimental results did not show relevant correlations of the BERT Masked Language Model (MLM) word prediction rank with coherence (or incoherence) measures calculated as average similarities (or distances) between BERT embeddings of text changed by the predicted words. In addition, in our manually created corpus of coherent and incoherent questions about data structures, coherence measures based on a topic model built from a few documents of the same domain discriminate coherent documents from incoherent ones with much higher precision than the coherence measures derived from BERT embeddings.
Downloads
References
Aletras, N. and Stevenson, M. (2013). Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)-Long Papers, pages 13-22, Potsdam, Germany. Association for Computational Linguistics. Available online [link].
Bao, M., Li, J., Zhang, J., Peng, H., and Liu, X. (2019). Learning semantic coherence for machine generated spam text detection. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1-8, Budapest, Hungary. IEEE. DOI: 10.1109/IJCNN.2019.8852340.
Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1):1-34. DOI: 10.1162/coli.2008.34.1.1.
Barzilay, R. and Lee, L. (2004). Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 113-120, Boston, Massachusetts, USA. Association for Computational Linguistics. DOI: 10.48550/arXiv.cs/0405039.
Bouarroudj, W., Boufaida, Z., and Bellatreche, L. (2022). Named entity disambiguation in short texts over knowledge graphs. Knowledge and Information Systems, 64(2):325-351. DOI: 10.1007/s10115-021-01642-9.
Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of German Society for Computational Linguistics and Language Technology (GSCL 2009), 30:31-40. Available online [link].
Braz Jr., O. O. and Fileto, R. (2021). Investigando coerência em postagens de um fórum de dúvidas em ambiente virtual de aprendizagem com o BERT. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 749-759. SBC. DOI: 10.5753/sbie.2021.217397.
Church, K. W. and Hanks, P. (1989). Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting on Association for Computational Linguistics, ACL '89, page 76–83, USA. Association for Computational Linguistics. DOI: 10.3115/981623.981633.
Churchill, R. and Singh, L. (2022). The evolution of topic modeling. ACM Computing Surveys, 54(10s):1-35. DOI: 10.1145/3507900.
Churchill, R. and Singh, L. (2023). Using topic-noise models to generate domain-specific topics across data sources. Knowledge and Information Systems, pages 1-28. DOI: 10.1007/s10115-022-01805-2.
Cormen, T. H., Stein, C., Rivest, R. L., and Leiserson, C. E. (2009). Introduction to Algorithms. MIT Press, New York, 3nd edition. Book.
Das, D. (2019). Nuclearity in RST and signals of coherence relations. In Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, pages 30-37, Minneapolis, MN. Association for Computational Linguistics. DOI: 10.18653/v1/W19-2705.
De Beaugrande, R.-A. and Dressler, W. U. (1981). Introduction to Text Linguistics, volume 1. Longman, London. Available online [link].
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423.
Farag, Y. and Yannakoudakis, H. (2019). Multi-task learning for coherence modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 629-639, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1060.
Foltz, P. W., Kintsch, W., and Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2-3):285-307. DOI: 10.1080/01638539809545029.
Gao, W., Peng, M., Wang, H., Zhang, Y., Xie, Q., and Tian, G. (2019). Incorporating word embeddings into topic modeling of short text. Knowledge and Information Systems, 61:1123-1145. DOI: 10.1007/s10115-018-1314-7.
Halliday, M. A. K. and Hasan, R. (2014). Cohesion in English. English language series. Taylor & Francis, London. Book.
Joty, S., Mohiuddin, M. T., and Tien Nguyen, D. (2018). Coherence modeling of asynchronous conversations: A neural entity grid approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 558-568, Melbourne, Australia. Association for Computational Linguistics. DOI: 10.18653/v1/P18-1052.
Koch, I. G. V. and Travaglia, L. C. (2021). A coerência textual. Repensando a língua portuguesa. Editora Contexto, São Paulo, Brasil. Available online [link].
Laban, P., Dai, L., Bandarkar, L., and Hearst, M. A. (2021). Can transformer models measure coherence in text: Re-thinking the shuffle test. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1058-1064, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.acl-short.134.
Li, J. and Jurafsky, D. (2017). Neural net models of open-domain discourse coherence. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 198-209, Copenhagen, Denmark. Association for Computational Linguistics. DOI: 10.18653/v1/D17-1019.
Lin, Z., Ng, H. T., and Kan, M.-Y. (2011). Automatically evaluating text coherence using discourse relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 997-1006, Portland, Oregon, USA. Association for Computational Linguistics. Available online [link].
Mesgar, M. and Strube, M. (2018). A neural local coherence model for text quality assessment. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4328-4339, Brussels, Belgium. Association for Computational Linguistics. DOI: 10.18653/v1/D18-1464.
Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, pages 262-272, Edinburgh, Scotland, UK. Association for Computational Linguistics. Available online [link].
Mohiuddin, M. T., Jwalapuram, P., Lin, X., and Joty, S. (2021). Rethinking coherence modeling: Synthetic vs. downstream tasks. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3528-3539, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.eacl-main.308.
Moon, H. C., Mohiuddin, T., Joty, S., and Xu, C. (2019). A unified neural coherence model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2262-2272, Hong Kong, China. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1231.
Muller, P., Braud, C., and Morey, M. (2019). ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents. In Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, pages 115-124, Minneapolis, MN. Association for Computational Linguistics. DOI: 10.18653/v1/W19-2715.
Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 100-108, USA. Association for Computational Linguistics. Available online [link].
Nie, A., Bennett, E., and Goodman, N. (2019). DisSent: Learning sentence representations from explicit discourse relations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4497-4510, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1442.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784-789, Melbourne, Australia. Association for Computational Linguistics. DOI: 10.18653/v1/P18-2124.
Řehůře, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45-50, Valletta, Malta. ELRA. Available online [link].
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982-3992, Hong Kong, China. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1410.
Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, WSDM '15, pages 399-408, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2684822.2685324.
Sarzynska-Wawer, J., Wawer, A., Pawlak, A., Szymanowska, J., Stefaniak, I., Jarkiewicz, M., and Okruszek, L. (2021). Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135. DOI: 10.1016/j.psychres.2021.114135.
Sayama, H. F., Araujo, A. V., and Fernandes, E. R. (2019). Faquad: Reading comprehension dataset in the domain of brazilian higher education. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pages 443-448. IEEE. DOI: 10.1109/BRACIS.2019.00084.
Shen, A., Mistica, M., Salehi, B., Li, H., Baldwin, T., and Qi, J. (2021). Evaluating document coherence modeling. Transactions of the Association for Computational Linguistics, 9:621-640. DOI: 10.1162/tacl_a_00388.
Smilkov, D., Thorat, N., Nicholson, C., Reif, E., Viégas, F. B., and Wattenberg, M. (2016). Embedding projector: Interactive visualization and interpretation of embeddings. CoRR, abs/1611.05469. DOI: 10.48550/arxiv.1611.05469.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems, pages 403-417, Rio Grande, Brazil. Springer, Springer. DOI: 10.1007/978-3-030-61377-8_28.
Vakulenko, S., de Rijke, M., Cochez, M., Savenkov, V., and Polleres, A. (2018). Measuring semantic coherence of a conversation. In The Semantic Web - ISWC 2018, pages 634-651, Cham. Springer International Publishing. DOI: 10.1007/978-3-030-00671-6_37.
Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(86):2579-2605. Available online [link].
Van Dijk, T. A., Kintsch, W., et al. (1983). Strategies of discourse comprehension. Academic Press New York, New York, New York. Available online [link].
Wadud, M. A. H. and Rakib, M. R. H. (2021). Text coherence analysis based on misspelling oblivious word embeddings and deep neural network. International Journal of Advanced Computer Science and Applications, 12(1). DOI: 10.14569/IJACSA.2021.0120124.
Wang, W., Feng, S., Wang, D., and Zhang, Y. (2019). Answer-guided and semantic coherent question generation in open-domain conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5066-5076, Hong Kong, China. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1511.
Wang, Y. and Guo, M. (2014). A short analysis of discourse coherence. Journal of Language Teaching and Research, 5(2):460. DOI: 10.4304/jltr.5.2.460-465.
Xu, P., Saghir, H., Kang, J. S., Long, T., Bose, A. J., Cao, Y., and Cheung, J. C. K. (2019). A cross-domain transferable neural coherence model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 678-687, Florence, Italy. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1067.
Zou, W. Y., Socher, R., Cer, D., and Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1393-1398, Seattle, Washington, USA. Association for Computational Linguistics. Available online [link].
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Osmar de Oliveira Braz Junior, Renato Fileto

This work is licensed under a Creative Commons Attribution 4.0 International License.

