Identification and classification of speech disfluencies: A systematic review on methods, databases, tools, evaluation and challenges

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.4443

Keywords:

Disfluencies, Speech Recognition, Spoken Dialogue, Natural Language Processing, Rich Transcription

Abstract

With the advancement of multimedia technologies, human-computer conversational interfaces are becoming increasingly important and are emerging as a highly promising area of research. Vocal representations, facial expressions, and body language can be used to extract various types of information. In the context of vocal representations, the complexity of human communication involves a wide range of expressions that vary according to grammatical rules, languages, accents, slang, disfluencies, and other speech events. In particular, the detection of disfluencies, i.e., interruptions in the normal flow of speech characterized by pauses, repetitions, and sound prolongations, is of interest not only for improving speech recognition systems but also for potentially identifying emotional aspects in audio. Several studies have aimed to define computational methods to identify and classify disfluencies, as well as appropriate evaluation methods in different languages. However, no studies have compiled the findings in the literature on this topic. This is important for both summarizing the motivations and applications of the research, as well as identifying opportunities that could guide new investigations. Our objective is to provide an analysis of the state of the art, the main limitations, and the challenges in this field. Eighty articles were extracted from four databases and analyzed through a systematic review. Our results show that research into the detection of disfluencies has been conducted for various purposes. Some aimed to improve the performance of translation tools, while others focused on the summarization of spoken dialogues, speaker diarization, and Natural Language Processing. Most of the research was oriented toward the English language. F-score, precision, and recall were the most commonly used evaluation measures for the reported methods. Statistical and machine learning techniques were widely applied, with CRFs (Conditional Random Fields), MaxEnt (Maximum Entropy), Decision Trees, and BLSTM (Bidirectional Long Short-Term Memory) being especially prominent. In general, newer approaches, such as BERT and BLSTM, have demonstrated higher performance. However, several challenges remain, opening up new research opportunities.

Downloads

References

Abdi, H., Valentin, D., and Edelman, B. (1999). Neural networks. Number 124. Sage. DOI: 10.4135/9781412985277.

ACDC (2024). Automated cardiac diagnosis challenge. Available at: [link] Last accessed 05 December 2024.

Avanzi, M. (2014). A corpus-based approach to french regional prosodic variation. Cahiers de linguistique française, (31):309-323. DOI: 10.1093/oxfordhb/9780198865131.013.20.

Bach, N. and Huang, F. (2019). Noisy BiLSTM-based models for disfluency detection. In Proc. Interspeech 2019, pages 4230-4234. DOI: 10.21437/Interspeech.2019-1336.

Barrett, L., Hu, J., and Howell, P. (2022). Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1160-1172. DOI: 10.1109/TASLP.2022.3155295.

Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., and Blackburn, J. (2020). The pushshift reddit dataset. Proceedings of the International AAAI Conference on Web and Social Media, 14(1):830-839. DOI: 10.1609/icwsm.v14i1.7347.

Belz, M., Müller, M., and Mooshammer, C. (2023). How consistent are non-native speakers in their usage of filler particles when talking to native speakers? In Disfluency in Spontaneous Speech (DiSS) Workshop 2023, pages 53-57. DOI: 10.21437/DiSS.2023-11.

Bertero, D., Wang, L., Chan, H. Y., and Fung, P. (2015). A comparison between a DNN and a CRF disfluency detection and reconstruction system. In Proc. Interspeech 2015, pages 844-848. DOI: 10.21437/Interspeech.2015-263.

Bui, H. H., Phung, D. Q., and Venkatesh, S. (2004). Hierarchical hidden markov models with general state hierarchy. In Proceedings of the national conference on artificial intelligence, pages 324-329. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Available at: [link].

Caines, A., Yannakoudakis, H., Allen, H., Pérez-Paredes, P., Byrne, B., and Buttery, P. (2022). The teacher-student chatroom corpus version 2: more lessons, new annotation, automatic detection of sequence shifts. In Alfter, D., Volodina, E., François, T., Desmet, P., Cornillie, F., Jönsson, A., and Rennes, E., editors, Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning, pages 23-35, Louvain-la-Neuve, Belgium. LiU Electronic Press. DOI: 10.3384/ecp190003.

Caines, A., Yannakoudakis, H., Edmondson, H., Allen, H., Pérez-Paredes, P., Byrne, B., and Buttery, P. (2020). The teacher-student chatroom corpus. In Alfter, D., Volodina, E., Pilan, I., Lange, H., and Borin, L., editors, Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning, pages 10-20, Gothenburg, Sweden. LiU Electronic Press. DOI: 10.3384/ecp2017510.

Calhoun, S. et al. (2009). NXT switchboard annotations ldc2009t26. DOI: 10.35111/nn2p-v103.

Canavan, A., Graff, D., and Zipperlen, G. (1997). Callhome american english speech ldc97s42. DOI: 10.35111/exq3-x930.

Canavan, A. and Zipperlen, G. (1996). Callfriend american english-non-southern dialect ldc96s46. DOI: 10.35111/d37s-c536.

Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., Post, W., Reidsma, D., Wellner, P., and McCowan, L. (2005). The AMI meeting corpus. In Proceedings of Symposium on Annotating and Measuring Meeting Behavior. DOI: 10.1007/11677482_3.

Chen, L. and Yoon, S.-Y. (2011). Detecting structural events for assessing non-native speech. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, IUNLPBEA '11, page 38–45, USA. Association for Computational Linguistics. DOI: 10.21437/interspeech.2010-282.

Chen, Q., Chen, M., Li, B., and Wang, W. (2020). Controllable time-delay transformer for real-time punctuation prediction and disfluency detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8069-8073. DOI: 10.1109/ICASSP40776.2020.9053159.

Cho, E., Kilgour, K., Niehues, J., and Waibel, A. (2015). Combination of NN and CRF models for joint detection of punctuation and disfluencies. In Proc. Interspeech 2015, pages 3650-3654. DOI: 10.21437/Interspeech.2015-724.

Chow, Y., Dunham, M., Kimball, O., Krasner, M., Kubala, G., Makhoul, J., Price, P., Roucos, S., and Schwartz, R. (1987). Byblos: The BBN continuous speech recognition system. In ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 12, pages 89-92. DOI: 10.1109/ICASSP.1987.1169748.

Christodoulides, G. and Avanzi, M. (2015). Automatic detection and annotation of disfluencies in spoken french corpora. In Proc. Interspeech 2015, pages 1849-1853. DOI: 10.21437/Interspeech.2015-69.

Cieri, C. et al. (2004). Fisher english training speech part 1 transcripts ldc2004t19. DOI: 10.35111/w4bk-9b14.

Clavel, C., Adda, G., Cailliau, F., Garnier-Rizet, M., Cavet, A., Chapuis, G., Courcinous, S., Danesi, C., Daquo, A.-L., and Suignard, P. (2013). Spontaneous speech and opinion detection: Mining call-centre transcripts. Language Resources and Evaluation, 47:1-37. DOI: 10.1007/s10579-013-9224-5.

Dao, M. H., Truong, T., and Nguyen, D. Q. (2022). From disfluency detection to intent detection and slot filling. In Proc. Interspeech 2022, pages 1106-1110. DOI: 10.21437/Interspeech.2022-10161.

Demuynck, K., Duchateau, J., Van Compernolle, D., and Wambacq, P. (2000). An efficient search space representation for large vocabulary continuous speech recognition. Speech communication, 30(1):37-53. DOI: 10.1016/s0167-6393(99)00030-8.

Deng, H., Lin, Y., Utsuro, T., Kobayashi, A., Nishizaki, H., and Hoshino, J. (2020). Automatic fluency evaluation of spontaneous speech using disfluency-based features. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9239-9243. DOI: 10.1109/ICASSP40776.2020.9053452.

Duchateau, J., Demuynck, K., and Van Compernolle, D. (1998). Fast and accurate acoustic modelling with semi-continuous HMMs. Speech Communication, 24(1):5-17. DOI: 10.1016/s0167-6393(98)00002-8.

Dufour, R., Estève, Y., Deléglise, P., and Béchet, F. (2009). Local and global models for spontaneous speech segment detection and characterization. In 2009 IEEE Workshop on Automatic Speech Recognition Understanding, pages 558-561. DOI: 10.1109/ASRU.2009.5372928.

Dutrey, C., Clavel, C., Rosset, S., Vasilescu, I., and Adda-Decker, M. (2014). A CRF-based approach to automatic disfluency detection in a french call-centre corpus. In Proc. Interspeech 2014, pages 2897-2901. DOI: 10.21437/Interspeech.2014-601.

Ferguson, J., Durrett, G., and Klein, D. (2015). Disfluency detection with a semi-markov model and prosodic features. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 257-262, Denver, Colorado. Association for Computational Linguistics. DOI: 10.3115/v1/N15-1029.

Fitzgerald, E., Hall, K., and Jelinek, F. (2009). Reconstructing false start errors in spontaneous speech text. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL '09, page 255–263, USA. Association for Computational Linguistics. DOI: 10.3115/1609067.1609095.

Fitzgerald, E. and Jelinek, F. (2008). Linguistic resources for reconstructing spontaneous speech text. Available at: [link].

Georgila, K. (2009). Using integer linear programming for detecting speech disfluencies. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short '09, page 109–112, USA. Association for Computational Linguistics. DOI: 10.3115/1620853.1620885.

Georgila, K., Wang, N., and Gratch, J. (2010). Cross-domain speech disfluency detection. In Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL '10, page 237–240, USA. Association for Computational Linguistics. Available at: [link].

Germesin, S., Becker, T., and Poller, P. (2008). Domain-specific classification methods for disfluency detection. In Proc. Interspeech 2008, pages 2518-2521. DOI: 10.21437/Interspeech.2008-624.

Ghosh, S., Kumar, S., Kumar, Y., Ratn Shah, R., and Umesh, S. (2022). Span classification with structured information for disfluency detection in spoken utterances. In Proc. Interspeech 2022, pages 3998-4002. DOI: 10.21437/Interspeech.2022-11242.

Godfrey, J. J. and Holliman, E. (1993a). Switchboard-1 release 2 ldc97s62. DOI: 10.35111/sw3h-rw02.

Godfrey, J. J. and Holliman, E. (1993b). Switchboard credit card ldc93s8. DOI: 10.35111/cmtf-v363.

Graff, D., Canavan, A., and Zipperlen, G. (1998). Switchboard-2 phase i ldc98s75. DOI: 10.35111/c7th-nf28.

Graff, D., Miller, D., and Walker, K. (2002). Switchboard-2 phase iii audio ldc2002s06. DOI: 10.35111/ydsv-hw57.

Graff, D., Walker, K., and Canavan, A. (1999). Switchboard-2 phase ii ldc99s79. DOI: 10.35111/5qpg-1r82.

Graff, D., Walker, K., and Miller, D. (2001a). Switchboard cellular part 1 audio ldc2001s13. DOI: 10.35111/a74g-hy08.

Graff, D., Walker, K., and Miller, D. (2001b). Switchboard cellular part 1 transcribed audio ldc2001s15. DOI: 10.35111/3wcn-6c29.

Graff, D., Walker, K., and Miller, D. (2001c). Switchboard cellular part 1 transcription ldc2001t14. DOI: 10.35111/8j7x-fx86.

Graff, D., Walker, K., and Miller, D. (2004). Switchboard cellular part 2 audio ldc2004s07. DOI: 10.35111/mgp6-4j96.

Gratch, J., Wang, N., Gerten, J., Fast, E., and Duffy, R. (2007). Creating rapport with virtual agents. volume 4722, pages 125-138. DOI: 10.1007/978-3-540-74997-4_12.

Gupta, N. and Bangalore, S. (2003). Segmenting spoken language utterances into clauses for semantic classification. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pages 525-530. DOI: 10.1109/ASRU.2003.1318495.

Gupta, N. K. and Bangalore, S. (2002). Extracting clauses for spoken language understanding in conversational systems. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP '02, page 273–280, USA. Association for Computational Linguistics. DOI: 10.3115/1118693.1118728.

Gupta, R., Ananthakrishnan, S., Yang, Z., and Narayanan, S. S. (2014). Variable span disfluency detection in ASR transcripts. In Proc. Interspeech 2014, pages 2892-2896. DOI: 10.21437/Interspeech.2014-600.

Hassan, H., Schwartz, L., Hakkani-Tür, D., and Tur, G. (2014). Segmentation and disfluency removal for conversational speech translation. In Proc. Interspeech 2014, pages 318-322. DOI: 10.21437/Interspeech.2014-76.

Honal, M. and Schultz, T. (2004). Correction of disfluencies in spontaneous speech using a noisy-channel approach. DOI: 10.21437/eurospeech.2003-741.

Honal, M. and Schultz, T. (2005). Automatic disfluency removal on recognized spontaneous speech - rapid adaptation to speaker-dependent disfluencies. In Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., volume 1, pages I/969-I/972 Vol. 1. DOI: 10.1109/ICASSP.2005.1415277.

Honnibal, M. and Johnson, M. (2014). Joint incremental disfluency detection and dependency parsing. Transactions of the Association for Computational Linguistics, 2:131-142. DOI: 10.1162/tacl_a_00171.

Horii, K., Fukuda, M., Ohta, K., Nishimura, R., Ogawa, A., and Kitaoka, N. (2022). End-to-end spontaneous speech recognition using disfluency labeling. In Proc. Interspeech 2022, pages 4108-4112. DOI: 10.21437/Interspeech.2022-281.

Hough, J. and Schlangen, D. (2015). Recurrent neural networks for incremental disfluency detection. In Proc. Interspeech 2015, pages 849-853. DOI: 10.21437/Interspeech.2015-264.

Hristea, F. T. (2011). Statistical Natural Language Processing, pages 1452-1453. Springer Berlin Heidelberg, Berlin, Heidelberg. DOI: 10.1007/978-3-642-04898-2_82.

Jamshid Lou, P., Anderson, P., and Johnson, M. (2018). Disfluency detection using auto-correlational neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4610-4619, Brussels, Belgium. Association for Computational Linguistics. DOI: 10.18653/v1/D18-1490.

Jamshid Lou, P. and Johnson, M. (2017). Disfluency detection using a noisy channel model and a deep neural language model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 547-553, Vancouver, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/P17-2087.

Jamshid Lou, P. and Johnson, M. (2020). Improving disfluency detection by self-training a self-attentive model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3754-3763, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl-main.346.

Jamshid Lou, P., Wang, Y., and Johnson, M. (2019). Neural constituency parsing of speech transcripts. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2756-2765, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1282.

Johnson, M. and Charniak, E. (2004). A tag-based noisy channel model of speech repairs. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL '04, page 33–es, USA. Association for Computational Linguistics. DOI: 10.3115/1218955.1218960.

Kaggle (2024). Data science bowl cardiac challenge data. Available at: [link]. Last accessed 05 December 2024.

Khara, S., Singh, S., and Vir, D. (2018). A comparative study of the techniques for feature extraction and classification in stuttering. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pages 887-893. DOI: 10.1109/ICICCT.2018.8473099.

Kingsbury, P. et al. (1997). Callhome american english transcripts ldc97t14. DOI: 10.35111/z1z4-ep76.

Kouzelis, T., Paraskevopoulos, G., Katsamanis, A., and Katsouros, V. (2023). Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling. In INTERSPEECH 2023, pages 1563-1567. DOI: 10.21437/Interspeech.2023-1887.

LDC (1992-2024). LDC - linguistic data consortium. Available at: [link]. Last accessed 05 December 2024.

Lease, M. and Johnson, M. (2006). Early deletion of fillers in processing conversational speech. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short '06, page 73–76, USA. Association for Computational Linguistics. DOI: 10.3115/1614049.1614068.

Lease, M., Johnson, M., and Charniak, E. (2006). Recognizing disfluencies in conversational speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1566-1573. DOI: 10.1109/TASL.2006.878269.

Lee, D., Ko, B., Shin, M. C., Whang, T., Lee, D., Kim, E., Kim, E., and Jo, J. (2021). Auxiliary sequence labeling tasks for disfluency detection. In Proc. Interspeech 2021, pages 4229-4233. DOI: 10.21437/Interspeech.2021-400.

Lee, H. and Strassel, S. (2005). RT-04 MDE training data speech ldc2005s16. DOI: 10.35111/27r9-h809.

Lendvai, P. (2003). Learning to identify fragmented words in spoken discourse. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics - Volume 2, EACL '03, page 25–32, USA. Association for Computational Linguistics. DOI: 10.3115/1067737.1067742.

Li, X., Ishi, C. T., Fu, C., and Hayashi, R. (2022). Prosodic and voice quality analyses of filled pauses in japanese spontaneous conversation by chinese learners and japanese native speakers. In Speech Prosody 2022, pages 550-554. DOI: 10.21437/SpeechProsody.2022-112.

Lickley, R. J. (2015). Fluency and disfluency. The handbook of speech production, pages 445-469. DOI: 10.1002/9781118584156.ch20.

Lin, B. and Wang, L. (2020). Joint prediction of punctuation and disfluency in speech transcripts. In Proc. Interspeech 2020, pages 716-720. DOI: 10.21437/Interspeech.2020-1277.

Lin, C.-K. and Lee, L.-S. (2005). Improved spontaneous mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features. In Proc. Interspeech 2005, pages 1621-1624. DOI: 10.21437/Interspeech.2005-533.

Lin, C.-K. and Lee, L.-S. (2009). Improved features and models for detecting edit disfluencies in transcribing spontaneous mandarin speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(7):1263-1278. DOI: 10.1109/TASL.2009.2014792.

Lin, C.-K. and shan Lee, L. (2006). Latent prosodic modeling (LPM) for speech with applications in recognizing spontaneous mandarin speech with disfluencies. In Proc. Interspeech 2006, pages paper 1901-Thu1FoP.9. DOI: 10.21437/Interspeech.2006-599.

Liu, Y. (2003). Word fragment identification using acoustic-prosodic features in conversational speech. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Proceedings of the HLT-NAACL 2003 Student Research Workshop - Volume 3, NAACLstudent '03, page 37–42, USA. Association for Computational Linguistics. DOI: 10.3115/1073416.1073423.

Liu, Y., Shriberg, E., Stolcke, A., and Harper, M. (2004). Using machine learning to cope with imbalanced classes in natural speech: evidence from sentence boundary and disfluency detection. In Proc. Interspeech 2004, pages 1525-1528. DOI: 10.21437/Interspeech.2004-573.

Liu, Y., Shriberg, E., Stolcke, A., and Harper, M. (2005). Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In Proc. Interspeech 2005, pages 3313-3316. DOI: 10.21437/Interspeech.2005-851.

Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., and Harper, M. (2006). Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1526-1540. DOI: 10.1109/TASL.2006.878255.

Loh, W.-Y. (2011). Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1):14-23. DOI: 10.1002/widm.8.

Maekawa, K. (2003). Corpus of spontaneous japanese: Its design and evaluation. Proceedings of SSPR. Available at: [link].

Maskey, S., Zhou, B., and Gao, Y. (2006). A phrase-level machine translation approach for disfluency detection using weighted finite state transducers. In Proc. Interspeech 2006, pages paper 1886-Tue1A1O.2. DOI: 10.21437/Interspeech.2006-262.

Medeiros, H., Moniz, H., Batista, F., Trancoso, I., and Nunes, L. (2013). Disfluency detection based on prosodic features for university lectures. In Proc. Interspeech 2013, pages 2629-2633. DOI: 10.21437/Interspeech.2013-605.

Meinedo, H., Caseiro, D., Neto, J., and Trancoso, I. (2003). Audimus.(media): A broadcast news speech recognition system for the european portuguese language. pages 9-17. DOI: 10.1007/3-540-45011-4_2.

Miller, T. (2009). Word buffering models for improved speech repair parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, EMNLP '09, page 737–745, USA. Association for Computational Linguistics. DOI: 10.3115/1699571.1699609.

Miller, T. and Schuler, W. (2008). A syntactic time-series model for parsing fluent and disfluent speech. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING '08, page 569–576, USA. Association for Computational Linguistics. DOI: 10.3115/1599081.1599153.

Nissim, M., Abzianidze, L., Evang, K., van der Goot, R., Haagsma, H., Plank, B., and Wieling, M. (2017). Sharing is caring: The future of shared tasks. Comput. Linguist., 43(4):897–904. DOI: 10.1162/COLI_a_00304.

Oostdijk, N. (2002). The design of the spoken dutch corpus. In New frontiers of corpus research, pages 105-112. Brill. DOI: 10.1163/9789004334113_008.

Ostendorf, M. and Hahn, S. (2013). A sequential repetition model for improved disfluency detection. In Proc. Interspeech 2013, pages 2624-2628. DOI: 10.21437/Interspeech.2013-604.

Oyez (2022). Oyez. Available at: [link].

Qian, X. and Liu, Y. (2013). Disfluency detection using multi-step stacked learning. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 820-825, Atlanta, Georgia. Association for Computational Linguistics. Available at: [link].

Rasooli, M. S. and Tetreault, J. (2013). Joint parsing and disfluency detection in linear time. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 124-129, Seattle, Washington, USA. Association for Computational Linguistics. Available at: [link].

Rasooli, M. S. and Tetreault, J. (2014). Non-monotonic parsing of fluent umm i mean disfluent sentences. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 48-53, Gothenburg, Sweden. Association for Computational Linguistics. Available at: [link].

Ribeiro, M. A. O. and Nunes, F. L. S. (2022). Left ventricle segmentation in cardiac MR: A systematic mapping of the last decade. ACM Comput. Surv. Just Accepted. DOI: 10.1145/3517190.

Rocholl, J. C., Zayats, V., Walker, D. D., Murad, N. B., Schneider, A., and Liebling, D. J. (2021). Disfluency detection with unlabeled data and small BERT models. In Proc. Interspeech 2021, pages 766-770. DOI: 10.48550/arXiv.2104.10769.

Rohanian, M. and Hough, J. (2020). Re-framing incremental deep language models for dialogue processing with multi-task learning. In Proceedings of the 28th International Conference on Computational Linguistics, pages 497-507, Barcelona, Spain (Online). International Committee on Computational Linguistics. DOI: 10.18653/v1/2020.coling-main.43.

S.-C., T. (2004). Processing spoken mandarin corpora. Traitement Automatique des Langues, 45(2):89-108. available at: [link].

SATA (2024). SATA segmentation challenge. Available at: [link]. Last accessed 05 December 2024.

Schettino, L., Maffia, M., De Micco, R., and Tessitore, A. (2023). Disfluency and speech management in italian patients with early-stage parkinson's disease. In Disfluency in Spontaneous Speech (DiSS) Workshop 2023, pages 23-27. DOI: 10.21437/DiSS.2023-5.

Shahih, K. M. and Purwarianti, A. (2016). Utterance disfluency handling in indonesian-english machine translation. In 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), pages 1-5. DOI: 10.1109/ICAICTA.2016.7803104.

Shriberg, E. E. (1994). Preliminaries to a theory of speech disfluencies. PhD thesis, University of California, Berkeley. Thesis.

SIGEDU (2024). SIGEDU - special interest group on building educational applications. Available at: [link]. Last accessed 05 December 2024.

Snover, M., Dorr, B., and Schwartz, R. (2004). A lexically-driven algorithm for disfluency detection. In Proceedings of HLT-NAACL 2004: Short Papers, HLT-NAACL-Short '04, page 157–160, USA. Association for Computational Linguistics. DOI: 10.3115/1613984.1614024.

Stallard, D., Prasad, R., Natarajan, P., Choi, F., Saleem, S., Meermeier, R., Krstovski, K., Ananthakrishnan, S., and Devlin, J. (2011). The BBN TransTalk Speech-to-Speech Translation System. DOI: 10.5772/19405.

Stolcke, A. and Shriberg, E. (1996). Statistical language modeling for speech disfluencies. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 405-408 vol. 1. DOI: 10.1109/ICASSP.1996.541118.

Stouten, F. and Martens, J. (2003). A feature-based filled pause detection system for dutch. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pages 309-314. DOI: 10.1109/ASRU.2003.1318459.

Stouten, F. and Martens, J.-P. (2004). Coping with disfluencies in spontaneous speech recognition. In Proc. Interspeech 2004, pages 1513-1516. DOI: 10.21437/Interspeech.2004-570.

Strassel, S. (2004). Linguistic resources for effective, affordable, reusable speech-to-text. In LREC. Available at: [link].

Tack, A., Kochmar, E., Yuan, Z., Bibauw, S., and Piech, C. (2023). The BEA 2023 shared task on generating AI teacher responses in educational dialogues. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 785-795, Toronto, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/2023.bea-1.64.

Tanaka, T., Masumura, R., Moriya, T., Oba, T., and Aono, Y. (2019). Disfluency detection based on speech-aware token-by-token sequence labeling with BLSTM-CRFs and attention mechanisms. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1009-1013. DOI: 10.1109/APSIPAASC47483.2019.9023119.

Taskar, B., Guestrin, C., and Koller, D. (2004). Max-margin markov networks. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems, volume 16. MIT Press. Available at: [link].

Teleki, M., Dong, X., Kim, S., and Caverlee, J. (2024). Comparing ASR systems in the context of speech disfluencies. In Interspeech 2024, pages 4548-4552. DOI: 10.21437/Interspeech.2024-1270.

Trancoso, I., Martins, R., Moniz, H., Mata, A., and Viana, C. (2008). The LECTRA corpus - classroom lecture transcriptions in european portuguese. Available at: [link].

Tsvetkov, Y., Sheikh, Z., and Metze, F. (2013). Identification and modeling of word fragments in spontaneous speech. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7624-7628. DOI: 10.1109/ICASSP.2013.6639146.

Volodina, E., Bryant, C., Caines, A., De Clercq, O., Frey, J.-C., Ershova, E., Rosen, A., and Vinogradova, O. (2023). MultiGED-2023 shared task at NLP4CALL: Multilingual grammatical error detection. In Alfter, D., Volodina, E., François, T., Jönsson, A., and Rennes, E., editors, Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning, pages 1-16, Tórshavn, Faroe Islands. LiU Electronic Press. Available at: [link].

Walker, C. et al. (2005). RT-04 MDE training data text/annotations ldc2005t24. DOI: 10.35111/qwyc-cw15.

Wang, F., Chen, W., Yang, Z., Dong, Q., Xu, S., and Xu, B. (2018). Semi-supervised disfluency detection. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3529-3538, Santa Fe, New Mexico, USA. Association for Computational Linguistics. Available at: [link].

Wang, S., Che, W., and Liu, T. (2016). A neural attention model for disfluency detection. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 278-287, Osaka, Japan. The COLING 2016 Organizing Committee. Available at: [link].

Wang, S., Che, W., Zhang, Y., Zhang, M., and Liu, T. (2017). Transition-based disfluency detection using lstms. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2785-2794, Copenhagen, Denmark. Association for Computational Linguistics. DOI: 10.18653/v1/D17-1296.

Wang, W., Stolcke, A., Yuan, J., and Liberman, M. (2013). A cross-language study on automatic speech disfluency detection. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 703-708, Atlanta, Georgia. Association for Computational Linguistics. Available at: [link].

Wang, W., Tur, G., Zheng, J., and Ayan, N. F. (2010). Automatic disfluency removal for improving spoken language translation. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5214-5217. DOI: 10.1109/ICASSP.2010.5494999.

Wang, X., Ng, H. T., and Sim, K. C. (2014a). A beam-search decoder for disfluency detection. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1457-1467, Dublin, Ireland. Dublin City University and Association for Computational Linguistics. Available at: [link].

Wang, X., Sim, K. C., and Ng, H. T. (2014b). Combining punctuation and disfluency prediction: An empirical study. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 121-130, Doha, Qatar. Association for Computational Linguistics. DOI: 10.3115/v1/D14-1013.

Williams, S., Lancaster, C., and Tanner, C. (2023). Inhibitory control and the production of disfluencies in speakers with alzheimer’s disease. In Disfluency in Spontaneous Speech (DiSS) Workshop 2023, pages 18-22. DOI: 10.21437/DiSS.2023-4.

Womack, K., McCoy, W., Alm, C. O., Calvelli, C., Pelz, J. B., Shi, P., and Haake, A. (2012). Disfluencies as extra-propositional indicators of cognitive processing. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics, ExProM '12, page 1–9, USA. Association for Computational Linguistics. Available at: https://aclanthology.org/W12-3801/.

Wu, C.-H. and Yan, G.-L. (2005). Speech act modeling and verification of spontaneous speech with disfluency in a spoken dialogue system. IEEE Transactions on Speech and Audio Processing, 13(3):330-344. DOI: 10.1109/TSA.2005.845820.

Wu, S., Zhang, D., Zhou, M., and Zhao, T. (2015). Efficient disfluency detection with transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 495-503, Beijing, China. Association for Computational Linguistics. DOI: 10.3115/v1/P15-1048.

Yang, J., Yang, D., and Ma, Z. (2020). Planning and generating natural and diverse disfluent texts as augmentation for disfluency detection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1450-1460, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.emnlp-main.113.

Yeh, J.-F. and Wu, C.-H. (2006). Edit disfluency detection and correction using a cleanup language model and an alignment model. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1574-1583. DOI: 10.1109/TASL.2006.878267.

Yeh, J.-F., Wu, C.-H., and Wu, W.-Y. (2007). Disfluency correction of spontaneous speech using conditional random fields with variable-length features. In Proc. Interspeech 2007, pages 2157-2160. DOI: 10.21437/Interspeech.2007-582.

Yildirim, S. and Narayanan, S. (2009). Automatic detection of disfluency boundaries in spontaneous speech of children using audio–visual information. IEEE Transactions on Audio, Speech, and Language Processing, 17(1):2-12. DOI: 10.1109/TASL.2008.2006728.

Yoshikawa, M., Shindo, H., and Matsumoto, Y. (2016). Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1036-1041, Austin, Texas. Association for Computational Linguistics. DOI: 10.18653/v1/D16-1109.

Zayats, V. and Ostendorf, M. (2019). Giving attention to the unexpected: Using prosody innovations in disfluency detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 86-95, Minneapolis, Minnesota. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1008.

Zayats, V., Ostendorf, M., and Hajishirzi, H. (2014). Multi-domain disfluency and repair detection. In Proc. Interspeech 2014, pages 2907-2911. DOI: 10.21437/Interspeech.2014-603.

Zayats, V., Ostendorf, M., and Hajishirzi, H. (2016). Disfluency detection using a bidirectional LSTM. In Proc. Interspeech 2016, pages 2523-2527. DOI: 10.21437/Interspeech.2016-1247.

Zechner, K. (2001). Automatic generation of concise summaries of spoken dialogues in unrestricted domains. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '01, page 199–207, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/383952.383989.

Zechner, K. (2002). Automatic summarization of open-domain multiparty dialogues in diverse genres. Comput. Linguist., 28(4):447–485. DOI: 10.1162/089120102762671945.

Zwarts, S. and Johnson, M. (2011). The impact of language models and loss functions on repair disfluency detection. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 703-711, Portland, Oregon, USA. Association for Computational Linguistics. DOI: 10.18653/v1/p17-2087.

Zwarts, S., Johnson, M., and Dale, R. (2010). Detecting speech repairs incrementally using a noisy channel approach. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING '10, page 1371–1378, USA. Association for Computational Linguistics. DOI: 10.3115/1218955.1218960.

Downloads

Published

2025-02-24

How to Cite

Luna, A. S., Machado-Lima, A., & Nunes, F. L. S. (2025). Identification and classification of speech disfluencies: A systematic review on methods, databases, tools, evaluation and challenges. Journal of the Brazilian Computer Society, 31(1), 154–173. https://doi.org/10.5753/jbcs.2025.4443

Issue

Section

Articles