Evaluation of Automatic Speech Recognition Approaches
DOI:
https://doi.org/10.5753/jidm.2022.2514Keywords:
automatic speech recognition, speech translation, speech to textAbstract
Automatic Speech Recognition (ASR) is essential for many applications like automatic caption generation for videos, voice search, voice commands for smart homes, and chatbots. Due to the increasing popularity of these applications and the advances in deep learning models for transcribing speech into text, this work aims to evaluate the performance of commercial solutions for ASR that use deep learning models, such as Facebook Wit.ai, Microsoft Azure Speech, Google Cloud Speech-to-Text, Wav2Vec, and AWS Transcribe. We performed the experiments with two real and public datasets, the Mozilla Common Voice and the Voxforge. The results demonstrate that the evaluated solutions slightly differ. However, Facebook Wit.ai outperforms the other analyzed approaches for the quality metrics collected like WER, BLEU, and METEOR. We also experiment to fine-tune Jasper Neural Network for ASR with four datasets different with no intersection to the ones we collect the quality metrics. We study the performance of the Jasper model for the two public datasets, comparing its results with the other pre-trained models.
Downloads
References
Amazon Transcribe Site. Amazon Transcribe. [link], 2021. [Online; accessed 11-January-2021].
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020. Curran Associates, Inc., Virtual-only Conference, pp. 12449–12460, 2020.
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings. ICLR, San Diego, CA, USA, 2015.
Banerjee, S. and Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp. 65–72, 2005.
Chan, W., Jaitly, N., Le, Q., and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, IEEE, Shanghai, China, pp. 4960–4964, 2016.
Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, IEEE, Calgary, Alberta, Canada, pp. 4774–4778, 2018.
Chiu, J. P. and Nichols, E. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics vol. 4, pp. 357–370, 2016.
Chollet, F. Deep learning with Python. Manning Publications, Shelter Island, NY, 2021.
CS231n: Convolutional Neural Networks for Visual Recognition. Convolutional neural networks for visual recognition. [link], 2022. Accessed: 2022-01-12.
de Lima, T. A. and Da Costa-Abreu, M. A survey on automatic speech recognition systems for portuguese language and its variations. Computer Speech & Language vol. 62, pp. 101055, 2020.
Demšar, J. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research vol. 7, pp. 1–30, 2006.
Dernoncourt, F., Bui, T., and Chang, W. A framework for speech recognition benchmarking. In Proc. Interspeech 2018. ISCA, Hyderabad, pp. 169–170, 2018.
Filippidou, F. and Moussiades, L. A benchmarking of ibm, google and wit automatic speech recognition systems. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer International Publishing, Cham, pp. 73–82, 2020.
Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Vancouver, BC, Canada, pp. 6645–6649, 2013.
Hartmann, N. S., Fonseca, E. R., Shulby, C. D., Treviso, M. V., Rodrigues, J. S., and Aluísio, S. M. Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana. SBC, Porto Alegre, RS, Brasil, pp. 122–131, 2017.
Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazare, P.-E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., et al. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Barcelona, Spain, pp. 7669–7673, 2020.
Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N. E. Y., Yamamoto, R., Wang, X., et al. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, Singapore, pp. 449–456, 2019.
Këpuska, V. and Bohouta, G. Comparing speech recognition systems (microsoft api, google api and cmu sphinx). Int. J. Eng. Res. Appl 7 (03): 20–24, 2017.
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., and Gadde, R. T. Jasper: An End-to-End Convolutional Neural Acoustic Model. In Proc. Interspeech 2019. ISCA, Graz, Austrian, pp. 71–75, 2019.
Likhomanenko, T., Xu, Q., Pratap, V., Tomasello, P., Kahn, J., Avidov, G., Collobert, R., and Synnaeve, G. Rethinking evaluation in ASR: are our models robust enough? CoRR vol. abs/2010.11745, pp. arXiv:2010.11745, 2020.
Mitrevski, M. Getting started with wit.ai. In Developing Conversational Interfaces for iOS: Add Responsive Voice Control to Your Apps. Apress, Berkeley, CA, pp. 143–164, 2018.
MSc program in Artificial Intelligence of the University of Amsterdam. Deep Learning Tutorials. [link], 2021. [Online; accessed 12-January-2021].
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACL ’02. Association for Computational Linguistics, USA, pp. 311–318, 2002.
Reddy, D. R. Speech recognition by machine: A review. Proceedings of the IEEE 64 (4): 501–531, 1976.
Sampaio, M., Magalhães, R., Silva, T., Cruz, L., Vasconcelos, D., Macêdo, J., and Ferreira, M. Evaluation of automatic speech recognition systems. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. SBC, Porto Alegre, RS, Brasil, pp. 301–306, 2021.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In ”31st Conference on Neural Information Processing Systems. (NIPS’17). Curran Associates Inc., Long Beach, CA, USA, pp. 6000–6010, 2017.
Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution, S. Kotz and N. L. Johnson (Eds.). Springer New York, New York, NY, pp. 196–202, 1992.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., Yu, D., and Zweig, G. Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (12): 2410–2423, 2017.
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, IEEE, Nova Orleans, EUA, pp. 5934–5938, 2018.