Noise-Robust Automatic Speech Recognition: A Case Study for Communication Interference

Julio Cesar Duarte; Sérgio Colcher

doi:10.5753/jis.2024.4267

Authors

Julio Cesar Duarte Military Institute of Engineering https://orcid.org/0000-0001-6656-1247
Sérgio Colcher Pontifical Catholic University of Rio de Janeiro https://orcid.org/0000-0002-3476-8718

DOI:

https://doi.org/10.5753/jis.2024.4267

Keywords:

Automatic Speech Recognition Systems, Noise Robustness, Portuguese ASRs

Abstract

An Automatic Speech Recognition (ASR) System is a software tool that converts a speech audio waveform into its corresponding text transcription. ASR systems are usually built using Artificial Intelligence techniques, particularly Machine Learning algorithms like Deep Learning, to address the multi-faceted complexity and variability of human speech. This allows these systems to learn from extensive speech datasets, adapt to several languages and accents, and continuously improve their performance over time, making them each time more versatile and effective in their purpose of transcribing spoken language to text. Much in the same way, we argue that the noises commonly present in the different environments also need to be explicitly dealt with, and, when possible, modeled within specific datasets with proper training. Our motivation comes from the observation that noise removal techniques (commonly called denoising), are not always fully (and generically) efficient. For instance, noise degeneration due to communication interference, which is almost always present in radio transmissions, has peculiarities that a simple mathematical formulation cannot model. This work presents a modeling technique composed of an augmented dataset-building approach and a profile identifier that can be used to build ASRs for noisy environments that perform similarly to those used in noise-free environments. As a case study, we developed a specific ASR for the interference noise in radio transmissions with its specific dataset, while comparing our results with other state-of-the-art work. As a result, we report a Character Error Rate value of 0.3163 for the developed ASR under several different noise conditions.

Downloads

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. [link]. Accessed: 09 July 2024.

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., et al. (2016). Deep speech 2 : End-to-end speech recognition in english and mandarin. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of ICML’16, pages 173–182, New York, New York, USA. PMLR. DOI: https://dl.acm.org/doi/10.5555/3045390.3045410.

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., and Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association. DOI: https://doi.org/10.48550/arXiv.1912.06670.

Candido Junior, A., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., and Aluísio, S. M. (2023). CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. Language Resources and Evaluation, 57:1139–1171. DOI: https://doi.org/10.1007/s10579-022-09621-4.

Carlson, A. B., Crilly, P. B., and Rutledge, J. C. (2002). Communication systems: An introduction to signals and noise in electrical communication. Boston: McGraw-Hill, 4th edition. DOI: No DOI available.

Centro Tecnológico do Exército (2020). Rádio definido por software de defesa (RDS-DEFESA). [link]. Accessed: 09 July 2024.

Duarte, J. C. and Colcher, S. (2021). Building a noisy audio dataset to evaluate machine learning approaches for automatic speech recognition systems. https://doi.org/10.48550/arXiv.2110.01425.

Exército Brasileiro (2019). Plano Estratégico do Exército Brasileiro 2020-2023. [link]. Accessed: 09 July 2024.

Fan, C., Yi, J., Tao, J., Tian, Z., Liu, B., and Wen, Z. (2021). Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:198–209. DOI: https://doi.org/10.1109/TASLP.2020.3039600.

Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/1143844.1143891.

Gris, L. R. S. (2021). Reconhecimento de voz utilizando Wav2Vec 2.0 para o português brasileiro. Bachelor’s thesis, Federal University of Technology - Paraná.

Gris, L. R. S., Casanova, E., de Oliveira, F. S., da Silva Soares, A., and Candido Junior, A. (2022). Brazilian portuguese speech recognition using wav2vec 2.0. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 333–343, Cham. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-98305-5_31.

Hannun, A. (2017). Sequence modeling with CTC. [link]. Accessed: 09 July 2024.

Hannun, A. Y., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., and Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. CoRR, abs/1412.5567. DOI: https://doi.org/10.48550/arXiv.1412.5567.

Heafield, K. (2011). KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics. DOI: https://dl.acm.org/doi/10.5555/2132960.2132986.

Huang, X., Acero, A., Hon, H.-W., and Reddy, R. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, USA, 1st edition. DOI: https://dl.acm.org/doi/book/10.5555/560905.

ITU, I. T. U. (1978-1982-1992). Recommendation F.520-2. Use of High Frequency Ionospheric Channel Simulators, volume III. Recommendations and Reports of the CCIR, Genova. DOI: No DOI available.

Jurafsky, D. and Martin, J. (2021). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall, 3 edition. DOI: https://dl.acm.org/doi/book/10.5555/555733.

Li, J., Deng, L., Gong, Y., and Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(4):745–777. DOI: https://doi.org/10.1109/TASLP.2014.2304637.

Maruf, M. R., Faruque, M. O., Mahmood, S., Nelima, N. N., Muhtasim, M. G., and Pervez, M. A. (2020). Effects of noise on RASTA-PLP and MFCC based bangla ASR using CNN. In 2020 IEEE Region 10 Symposium (TENSYMP), pages 1564–1567. DOI: https://doi.org/10.1109/TENSYMP50017.2020.9231034.

Menêses Santos, R. (2016). Uma abordagem híbrida CNN-HMM para reconhecimento de fala tolerante a ruídos de ambiente. Dissertation, Universidade Federal de Sergipe.

Mozilla Corporation (2020). Welcome to DeepSpeech’s documentation! [link]. Accessed: 09 July 2024.

Pervaiz, A., Hussain, F., Israr, H., Tahir, M. A., Raja, F. R., Baloch, N. K., Ishmanov, F., and Zikria, Y. B. (2020). Incorporating noise robustness in speech command recognition by noise augmentation of training data. Sensors, 20(8). DOI: https://doi.org/10.3390/s20082326.

Prodeus, A. and Kukharicheva, K. (2016). Training of automatic speech recognition system on noised speech. In 2016 4th International Conference on Methods and Systems of Navigation and Motion Control (MSNMC), pages 221–223. DOI: https://doi.org/10.1109/MSNMC.2016.7783147.

Prodeus, A. and Kukharicheva, K. (2017). Automatic speech recognition performance for training on noised speech. In 2017 2nd International Conference on Advanced Information and Communication Technologies (AICT), pages 71–74. DOI: https://doi.org/10.1109/AIACT.2017.8020068.

Quintanilha, I. M., Netto, S. L., and Biscainho, L. P. (2020). An open-source end-to-end asr system for brazilian portuguese using dnns built from newly assembled corpora. Journal of Communication and Information Systems, 35(1):230–242. DOI: https://doi.org/10.14209/jcis.2020.25.

Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR. DOI: https://dl.acm.org/doi/10.5555/3618408.3619590.

Ribeiro, F. C. (2019). Reconhecimento de comandos de voz em português brasileiro em ambientes ruidosos usando laringofone. PhD thesis, Universidade Federal do Ceará.

Scart, L. G., Vassallo, R. F., and Samatelo, J. L. A. (2022). Aplicação de um modelo neural para reconhecimento de fala em Áudios com características de comunicação via rádio. In Anais do CBA 2022: XXIV Congresso Brasileiro de Automática. DOI: No DOI available.

Shimada, K., Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., and Kawahara, T. (2019). Unsupervised speech enhancement based on multichannel nmf-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 27(5):960–971. DOI: https://ieeexplore.ieee.org/document/8673623.

Wang, Q., Wang, S., Ge, F., Han, C. W., Lee, J., Guo, L., and Lee, C.-H. (2018). Two-stage enhancement of noisy and reverberant microphone array speech for automatic speech recognition systems trained with only clean speech. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 21–25. DOI: https://doi.org/10.1109/ISCSLP.2018.8706595.

Wang, Z.-Q. and Wang, D. (2016). A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 24(4):796–806. DOI: https://doi.org/10.1109/TASLP.2016.2528171.

Yílmaz, E., Gemmeke, J. F., and Van Hamme, H. (2014). Noise robust exemplar matching using sparse representations of speech. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(8):1306–1319. DOI: https://doi.org/10.1109/TASLP.2014.2329188.

Zhang, Y. and Li, J. (2023). Birdsoundsdenoising: Deep visual audio denoising for bird sounds. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2247–2256. DOI: https://doi.org/10.1109/WACV56688.2023.00228.