Enhancing Automatic Speech Recognition Medical Transcriptions

Yanna Torres Gonçalves; João Victor B Alves; Breno Alef Dourado Sá; José A Fernandes de Macedo; Ticiana L Coelho da Silva

doi:10.5753/jidm.2026.5737

Authors

Yanna Torres Gonçalves Universidade Federal do Ceará https://orcid.org/0009-0003-9110-4963
João Victor B Alves Universidade Federal do Ceará https://orcid.org/0009-0006-5734-4826
Breno Alef Dourado Sá Universidade Federal do Ceará https://orcid.org/0009-0009-0311-9641
José A Fernandes de Macedo Universidade Federal do Ceará https://orcid.org/0000-0002-0661-2978
Ticiana L Coelho da Silva Universidade Federal do Ceará https://orcid.org/0000-0001-7686-9827

DOI:

https://doi.org/10.5753/jidm.2026.5737

Keywords:

Medical History, Automatic Speech Recognition, Language Model, Text Style Transfer

Abstract

Automated Speech Recognition (ASR) systems can reduce cognitive load and improve efficiency in medical documentation. This study evaluates Whisper and Wav2Vec2 PT for transcribing medical histories in Brazilian Portuguese. Using real audio-text pairs recorded by specialists and nonspecialists, we assess model performance across speaker profiles. We explore decoding with n-gram language models and post-processing with a BERT-based classifier to correct common spelling errors. Additionally, we apply large language models (LLMs) for text style transfer (TST), converting transcriptions into structured medical anamneses through prompt-based methods. Results show that Whisper outperforms Wav2Vec2 PT overall. The BERT-based correction model improves transcription accuracy, especially when applied after normalization. Among the LLMs tested, Mistral produced the most consistent and structured outputs. These findings demonstrate the potential of combining ASR with language model enhancements for medical documentation, while also highlighting ongoing challenges in clinical ASR.

Downloads

Download data is not yet available.

References

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, pages 12449-12460.

Chiu, C.-C., Tripathi, A., Chou, K., Co, C., Jaitly, N., Jaunzeikare, D., Kannan, A., Nguyen, P., Sak, H., Sankar, A., et al. (2017). Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37-46.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.

Goncalves, Y. T., Alves, J. V. B., Sa, B. A. D., da Silva, L. N., de Macedo, J. A. F., and da Silva, T. L. C. (2024). Speech recognition models in assisting medical history. In Proceedings of the 2024 SBBD.

Kar, S., Mishra, P., Lin, J., Woo, M.-J., Deas, N., Linduff, C., Niu, S., Yang, Y., McClendon, J., Smith, D. H., et al. (2021). Systematic evaluation and enhancement of speech recognition in operational medical environments. In IJCNN, pages 1-8.

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1-2):81-93.

Lai, W., Hangya, V., and Fraser, A. (2024). Style-specific neurons for steering llms in text style transfer. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13427-13443.

Lee, T.-Y., Li, C.-C., Chou, K.-R., Chung, M.-H., Hsiao, S.-T., Guo, S.-L., Hung, L.-Y., and Wu, H.-T. (2023). Machine learning-based speech recognition system for nursing documentation-a pilot study. IJMI, 178:105213.

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). On the sentence embeddings from pre-trained language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the EMNLP, pages 9119-9130.

Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., and Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. In Proc. Interspeech 2019, pages 71-75. ISCA. DOI: 10.21437/Interspeech.2019-1819.

Liu, Q., Qin, J., Ye, W., Mou, H., He, Y., and Wang, K. (2024). Adaptive prompt routing for arbitrary text style transfer with pre-trained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18689-18697.

Mukherjee, S., Ojha, A. K., and Dusek, O. (2024). Are large language models actually good at text style transfer? arXiv preprint arXiv:2406.05885.

Paats, A., Alumae, T., Meister, E., and Fridolin, I. (2018). Retrospective analysis of clinical performance of an estonian speech recognition system for radiology: effects of different acoustic and language models. JDI, 31(5):615-621.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, page 311-318, USA. Association for Computational Linguistics. DOI: 10.3115/1073083.1073135.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In ICML, pages 28492-28518.

Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4):501-531.

Rubel Schneider, E. T., Andrioli de Souza, J. V., Knafou, J., Oliveira, L. E., Gumiel, Y. B., de Oliveira, L. F., Teodoro, D., Paraiso, E. C., Moro, C., et al. (2020). Biobertpt: a portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop. 19 November 2020.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech 2019, pages 3465-3469.

Sullivan, P., Shibano, T., and Abdul-Mageed, M. (2022). Improving automatic speech recognition for non-native english with transfer learning and language model decoding. In AANLSP, pages 21-44.

Sunkara, M., Ronanki, S., Dixit, K., Bodapati, S., and Kirchhoff, K. (2020). Robust prediction of punctuation and truecasing for medical ASR. In Proceedings of the First Workshop on Natural Language Processing for Medical Conversations, pages 53-62. Association for Computational Linguistics. DOI: 10.18653/v1/2020.nlpmc-1.8.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In NIPS, pages 6000-6010.

Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Kotz, S. and Johnson, N. L., editors, Breakthroughs in Statistics: Methodology and Distribution, pages 196-202. Springer New York, New York, NY.

Enhancing Automatic Speech Recognition Medical Transcriptions

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: