Enhancing Automatic Speech Recognition Medical Transcriptions

Authors

DOI:

https://doi.org/10.5753/jidm.2026.5737

Keywords:

Medical History, Automatic Speech Recognition, Language Model, Text Style Transfer

Abstract

Automated Speech Recognition (ASR) systems can reduce cognitive load and improve efficiency in medical documentation. This study evaluates Whisper and Wav2Vec2 PT for transcribing medical histories in Brazilian Portuguese. Using real audio-text pairs recorded by specialists and nonspecialists, we assess model performance across speaker profiles. We explore decoding with n-gram language models and post-processing with a BERT-based classifier to correct common spelling errors. Additionally, we apply large language models (LLMs) for text style transfer (TST), converting transcriptions into structured medical anamneses through prompt-based methods. Results show that Whisper outperforms Wav2Vec2 PT overall. The BERT-based correction model improves transcription accuracy, especially when applied after normalization. Among the LLMs tested, Mistral produced the most consistent and structured outputs. These findings demonstrate the potential of combining ASR with language model enhancements for medical documentation, while also highlighting ongoing challenges in clinical ASR.

Downloads

Download data is not yet available.

References

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, pages 12449-12460.

Chiu, C.-C., Tripathi, A., Chou, K., Co, C., Jaitly, N., Jaunzeikare, D., Kannan, A., Nguyen, P., Sak, H., Sankar, A., et al. (2017). Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37-46.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.

Goncalves, Y. T., Alves, J. V. B., Sa, B. A. D., da Silva, L. N., de Macedo, J. A. F., and da Silva, T. L. C. (2024). Speech recognition models in assisting medical history. In Proceedings of the 2024 SBBD.

Kar, S., Mishra, P., Lin, J., Woo, M.-J., Deas, N., Linduff, C., Niu, S., Yang, Y., McClendon, J., Smith, D. H., et al. (2021). Systematic evaluation and enhancement of speech recognition in operational medical environments. In IJCNN, pages 1-8.

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1-2):81-93.

Lai, W., Hangya, V., and Fraser, A. (2024). Style-specific neurons for steering llms in text style transfer. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13427-13443.

Lee, T.-Y., Li, C.-C., Chou, K.-R., Chung, M.-H., Hsiao, S.-T., Guo, S.-L., Hung, L.-Y., and Wu, H.-T. (2023). Machine learning-based speech recognition system for nursing documentation-a pilot study. IJMI, 178:105213.

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). On the sentence embeddings from pre-trained language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the EMNLP, pages 9119-9130.

Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., and Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. In Proc. Interspeech 2019, pages 71-75. ISCA. DOI: 10.21437/Interspeech.2019-1819.

Liu, Q., Qin, J., Ye, W., Mou, H., He, Y., and Wang, K. (2024). Adaptive prompt routing for arbitrary text style transfer with pre-trained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18689-18697.

Mukherjee, S., Ojha, A. K., and Dusek, O. (2024). Are large language models actually good at text style transfer? arXiv preprint arXiv:2406.05885.

Paats, A., Alumae, T., Meister, E., and Fridolin, I. (2018). Retrospective analysis of clinical performance of an estonian speech recognition system for radiology: effects of different acoustic and language models. JDI, 31(5):615-621.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, page 311-318, USA. Association for Computational Linguistics. DOI: 10.3115/1073083.1073135.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In ICML, pages 28492-28518.

Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4):501-531.

Rubel Schneider, E. T., Andrioli de Souza, J. V., Knafou, J., Oliveira, L. E., Gumiel, Y. B., de Oliveira, L. F., Teodoro, D., Paraiso, E. C., Moro, C., et al. (2020). Biobertpt: a portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop. 19 November 2020.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech 2019, pages 3465-3469.

Sullivan, P., Shibano, T., and Abdul-Mageed, M. (2022). Improving automatic speech recognition for non-native english with transfer learning and language model decoding. In AANLSP, pages 21-44.

Sunkara, M., Ronanki, S., Dixit, K., Bodapati, S., and Kirchhoff, K. (2020). Robust prediction of punctuation and truecasing for medical ASR. In Proceedings of the First Workshop on Natural Language Processing for Medical Conversations, pages 53-62. Association for Computational Linguistics. DOI: 10.18653/v1/2020.nlpmc-1.8.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In NIPS, pages 6000-6010.

Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Kotz, S. and Johnson, N. L., editors, Breakthroughs in Statistics: Methodology and Distribution, pages 196-202. Springer New York, New York, NY.

Downloads

Published

2026-03-13

How to Cite

Gonçalves, Y. T., Alves, J. V. B., Sá, B. A. D., Macedo, J. A. F. de, & Silva, T. L. C. da. (2026). Enhancing Automatic Speech Recognition Medical Transcriptions. Journal of Information and Data Management, 17(1), 82–91. https://doi.org/10.5753/jidm.2026.5737

Issue

Section

SBBD 2024 Full papers - Extended papers