The evaluation of prosody in speech synthesis: a systematic review

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5468

Keywords:

Speech Synthesis, TTS, Prosody, Objective evaluation, Subjective evaluation

Abstract

This paper presents a systematic review on the relationship between prosody and speech synthesis, focusing on the evaluation of prosodic parameters of synthesized speech. The relevance of the topic lies in the fact that the task of speech synthesis has not yet been resolved, therefore the information obtained in this review can contribute to knowledge and to the improvement of the methodologies used in evaluating the prosody of synthesized speech. To select studies, we used the Parsifal platform, including 100 studies published between 2020 and 2024, with the purpose of answering eight previously established research questions. The highlights of this systematic review are presented in the following. The main prosodic parameters considered in speech synthesis systems are fundamental frequency (F0), duration and intensity, with F0 standing out (95 studies). The metric most frequently used in studies belongs to the group of acoustic metrics --- F0-RMSE (Root mean-squared error evaluation of F0). Lower values of this metric indicate greater proximity between the F0 of synthesized speech and that of natural speech. The most used dataset was LJ Speech, a public domain speech dataset consisting of English audio clips of a single speaker reading short excerpts from seven non-fiction books, reinforcing that the predominant language was English --- 48 studies focus on English to evaluate speech description prosody, although there is a relevant number of studies in Mandarin Chinese (27 studies) and Japanese (15 studies).Most studies used models as a baseline to compare the performance of their methods or proposed new models in order to improve the prosody of synthesized speech. Each study presented different methods for this improvement, according to the objectives, such as learning prosodic features extracted from reference speech and adding auxiliary modules to existing model architectures. As highlighted baselines, there was recurrent use of Tacotron 2, which generates mel-spectrograms from text and then synthesize speech from the generated mel-spectrograms using a separately trained vocoder, and FastSpeech 2, which can extract explicit prosodic features to be directly used as entry into training.

Downloads

Download data is not yet available.

References

Al-Radhi, M. S., Ibrahim, O., Mandeel, A. R., Csapó, T. G., and Németh, G. (2023). Advancing limited data text-to-speech synthesis: Non-autoregressive transformer for high-quality parallel synthesis. In 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 152-157. IEEE. DOI: 10.1109/SpeD59241.2023.10314948.

Alwaisi, S. and Németh, G. (2024). Advancements in expressive speech synthesis: A review. INFOCOMMUNICATIONS JOURNAL, 16(1):35-46. DOI: 10.36244/ICJ.2024.1.5.

Aso, M., Takamichi, S., Takamune, N., and Saruwatari, H. (2020). Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis. Speech Communication, 125:53-60. DOI: 10.1016/j.specom.2020.09.003.

Bae, J.-S., Bak, T., Joo, Y.-S., and Cho, H.-Y. (2021). Hierarchical context-aware transformers for non-autoregressive text to speech. In Interspeech 2021, pages 3610-3614. DOI: 10.21437/Interspeech.2021-471.

Bai, Q., Ko, T., and Zhang, Y. (2022). A study of modeling rising intonation in cantonese neural speech synthesis. In Interspeech 2022, pages 501-505. DOI: 10.21437/Interspeech.2022-11173.

Bak, T., Bae, J.-S., Bae, H., Kim, Y.-I., and Cho, H.-Y. (2021). Fastpitchformant: Source-filter based decomposed modeling for speech synthesis. In Interspeech 2021, pages 116-120. DOI: 10.21437/Interspeech.2021-866.

Bak, T., Eom, Y., Choi, S., and Joo, Y.-S. (2024). MultiVerse: Efficient and expressive zero-shot multi-task text-to-speech. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9130-9147, Miami, Florida, USA. Association for Computational Linguistics. DOI: 10.18653/v1/2024.findings-emnlp.533.

Bauer, J., Zalkow, F., Müller, M., and Dittmar, C. (2024). Evaluating the impact of prosody feature normalization on the controllability of pitch in speech synthesis. In Baumann, T., editor, Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz, Regensburg, 6.-8. März 2024, pages 188-195. TUDpress. DOI: 10.35096/othr/pub-7097.

Boersma, P. and Weenink, D. (2024). Praat: doing phonetics by computer [Computer program]. Version 6.3.10. Available at: [link].

Bott, T. (2023). Content-aware text-to-speech with prompt-based prosody control. Masterarbeit, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart. Available at: [link].

Cagliari, L. C. (1992). Prosódia: algumas funções dos supra-segmentos. Cadernos de estudos linguísticos, 23:137-151. DOI: 10.20396/cel.v23i0.8636850.

Casanova, E., Santos, V. G., Svartman, F. R. F., Leite, M. Q., Candido Junior, A., Marcacini, R. M., Rezende, S. O., and Aluísio, S. M. (2024). Recursos para o processamento de fala. In Caseli, H. M. and Nunes, M. G. V., editors, Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português, book chapter 3. BPLN, 3 edition. Available at: [link].

Chemnad, K. and Othman, A. (2023). Advancements in arabic text-to-speech systems: a 22-year literature review. IEEE Access, 11:30929-30954. DOI: 10.1109/ACCESS.2023.3260844.

Chen, L., Deng, Y., Wang, X., Soong, F. K., and He, L. (2021). Speech bert embedding for improving prosody in neural tts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6563-6567. IEEE. DOI: 10.1109/ICASSP39728.2021.9413864.

Chen, S.-H., Hwang, S.-H., and Wang, Y.-R. (1998). An rnn-based prosodic information synthesizer for mandarin text-to-speech. IEEE transactions on speech and audio processing, 6(3):226-239. DOI: 10.1109/89.668817.

Chien, C.-M. and Lee, H.-y. (2021). Hierarchical prosody modeling for non-autoregressive speech synthesis. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 446-453. IEEE. DOI: 10.1109/SLT48900.2021.9383629.

Cooper, E., Huang, W.-C., Tsao, Y., Wang, H.-M., Toda, T., and Yamagishi, J. (2024). A review on subjective and objective evaluation of synthetic speech. Acoustical Science and Technology, 45(4):161-183. DOI: 10.1250/ast.e24.12.

Deng, Y., Xue, J., Jia, Y., Li, Q., Han, Y., Wang, F., Gao, Y., Ke, D., and Li, Y. (2024). Concss: Contrastive-based context comprehension for dialogue-appropriate prosody in conversational speech synthesis. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10706-10710. IEEE. DOI: 10.1109/icassp48485.2024.10446506.

Do, P., Coler, M., Dijkstra, J., and Klabbers, E. (2021). A systematic review and analysis of multilingual data strategies in text-to-speech for low-resource languages. In Interspeech 2021, pages 16-20. DOI: 10.21437/Interspeech.2021-1565.

Freitas e Souza, V. (2015). ECOS PL-Science: Uma Arquitetura para Ecossistemas de Software Científico Apoiada por uma Rede Ponto a Ponto. Master's thesis, Universidade Federal de Juiz de Fora. Available at: [link].

Fujii, K., Saito, Y., and Saruwatari, H. (2022). Adaptive end-to-end text-to-speech synthesis based on error correction feedback from humans. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1702-1707. IEEE. DOI: 10.23919/APSIPAASC55919.2022.9979876.

Fujimaki, D., Nose, T., and Ito, A. (2020). Integration of accent sandhi and prosodic features estimation for japanese text-to-speech synthesis. In 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), pages 358-359. IEEE. DOI: 10.1109/GCCE50665.2020.9291906.

Furukawa, K. (2022). Applying syntax-prosody mapping hypothesis, prosodic well-formedness constraints, and boundary-driven theory to neural sequence-to-sequence speech synthesis. DOI: 10.48550/arXiv.2203.15276.

Galdino, J. C. and Oliveira Jr, M. (2023). Prosódia e síntese da fala: uma revisão integrativa da literatura. Revista da ABRALIN, pages 1-15. DOI: 10.25189/rabralin.v22i1.2130.

Gong, C., Wang, L., Ling, Z., Guo, S., Zhang, J., and Dang, J. (2021). Improving naturalness and controllability of sequence-to-sequence speech synthesis by learning local prosody representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5724-5728. IEEE. DOI: 10.1109/ICASSP39728.2021.9414720.

Guo, D., Zhu, X., Xue, L., Li, T., Lv, Y., Jiang, Y., and Xie, L. (2023). Hignn-tts: Hierarchical prosody modeling with graph neural networks for expressive long-form tts. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1-7. IEEE. DOI: 10.1109/ASRU57964.2023.10389629.

Hamed, M. and Lachiri, Z. (2024a). Expressivity transfer in transformer-based text-to-speech synthesis. In 2024 IEEE 7th International Conference on Advanced Technologies, Signal and Image Processing (ATSIP), volume 1, pages 443-448. IEEE. DOI: 10.1109/ATSIP62566.2024.10638975.

Hamed, M. and Lachiri, Z. (2024b). Fine-grained prosody transfer text-to-speech synthesis with transformer. In 2024 5th International Conference in Electronic Engineering, Information Technology & Education (EEITE), pages 1-7. IEEE. DOI: 10.1109/EEITE61750.2024.10654450.

He, Y., Luan, J., and Wang, Y. (2022). Pama-tts: Progression-aware monotonic attention for stable seq2seq tts with accurate phoneme duration control. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7467-7471. IEEE. DOI: 10.1109/ICASSP43922.2022.9746202.

Herrmann, B. (2023). The perception of artificial-intelligence (ai) based synthesized speech in younger and older adults. International Journal of Speech Technology, 26(2):395-415. DOI: 10.1007/s10772-023-10027-y.

Hida, R., Hamada, M., Kamada, C., Tsunoo, E., Sekiya, T., and Kumakura, T. (2022). Polyphone disambiguation and accent prediction using pre-trained language models in japanese tts front-end. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7132-7136. IEEE. DOI: 10.1109/ICASSP43922.2022.9746212.

Hirst, D., Rilliard, A., and Aubergé, V. (1998). Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthesis. In The Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, pages 1-4. Available at: [link].

Hodari, Z. (2022). Synthesising prosody with insufficient context. PhD thesis, The University of Edinburgh - School of Informatics. DOI: 10.7488/era/2654.

Huang, R., Zhang, C., Ren, Y., Zhao, Z., and Yu, D. (2023). Prosody-TTS: Improving prosody with masked autoencoder and conditional diffusion model for expressive text-to-speech. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8018-8034, Toronto, Canada. Association for Computational Linguistics. DOI: 10.18653/v1/2023.findings-acl.508.

Iliescu, D. A., Mohan, D. S. R., Teh, T. H., and Hodari, Z. (2024). Controllable prosody generation with partial inputs. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11916-11920. IEEE. DOI: 10.1109/ICASSP48485.2024.10446859.

Inoue, S., Zhou, K., Wang, S., and Li, H. (2024). Hierarchical emotion prediction and control in text-to-speech synthesis. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10601-10605. IEEE. DOI: 10.1109/ICASSP48485.2024.10445996.

Jiang, C., Gao, Y., Jin, H., Pan, L., and Ng, W. W. (2024a). Fastmandarin: Efficient local modeling for natural mandarin speech synthesis. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 461-465. IEEE. DOI: 10.1109/ICASSP48485.2024.10446112.

Jiang, C., Gao, Y., Ng, W. W., Zhou, J., Zhong, J., Zhen, H., and Hu, X. (2024b). Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis. Neurocomputing, 608:128430. DOI: 10.1016/j.neucom.2024.128430.

Jiang, Z., Liu, J., Ren, Y., He, J., Ye, Z., Ji, S., Yang, Q., Zhang, C., Wei, P., Wang, C., et al. (2024c). Mega-tts 2: Boosting prompting mechanisms for zero-shot speech synthesis. In The Twelfth International Conference on Learning Representations. DOI: 10.48550/arXiv.2307.07218.

Jiang, Z., Su, Z., Zhao, Z., Yang, Q., Ren, Y., Liu, J., and Ye, Z. (2022). Dict-TTS: learning to pronounce with prior dictionary knowledge for text-to-speech. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, pages 11960-11974, Red Hook, NY, USA. Curran Associates Inc.. DOI: 10.48550/arXiv.2206.02147.

Jokisch, O., Mixdorff, H., Kruschke, H., and Kordon, U. (2000). Learning the parameters of quantitative prosody models. In 6th International Conference on Spoken Language Processing (ICSLP 2000), pages 645-648. DOI: 10.21437/ICSLP.2000-160.

Ju, Y., Kim, I., Yang, H., Kim, J.-H., Kim, B., Maiti, S., and Watanabe, S. (2022). Trinitts: Pitch-controllable end-to-end tts without external aligner. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pages 16-20. DOI: 10.21437/Interspeech.2022-925.

Kaiki, N., Sakti, S., and Nakamura, S. (2021). Using local phrase dependency structure information in neural sequence-to-sequence speech synthesis. In 2021 24th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pages 206-211. IEEE. DOI: 10.1109/O-COCOSDA202152914.2021.9660456.

Kent, R. D. and Read, C. (2015). Análise acústica da fala. Cortez Editora, 1a edition. Book. ISBN-13 978-8524923319.

Kulkarni, A. (2022). Expressivity transfer in deep learning based text-to-speech synthesis. PhD thesis, Université de Lorraine. Available at: [link].

Kumar, N., Narang, A., and Lall, B. (2022). Zero-shot normalization driven multi-speaker text to speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1679-1693. DOI: 10.1109/TASLP.2022.3169634.

Kurihara, K. (2024). Phonetic and prosodic features for sequence-to-sequence acoustic modeling on japanese text-to-speech and their estimation. University of Tsukuba. DOI: 10.15068/0002012966.

Lameris, H., Mehta, S., Henter, G. E., Gustafson, J., and Székely, É. (2023). Prosody-controllable spontaneous tts with neural hmms. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1-5. IEEE. DOI: 10.1109/ICASSP49357.2023.10097200.

Lańcucki, A. (2021). Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588-6592. IEEE. DOI: 10.1109/ICASSP39728.2021.9413889.

Lee, J., Lee, J. Y., Choi, H., Mun, S., Park, S., Bae, J.-S., and Kim, C. (2022). Into-TTS: Intonation template based prosody control system. DOI: 10.48550/arXiv.2204.01271.

Lei, S., Zhou, Y., Chen, L., Wu, Z., Kang, S., and Meng, H. (2023). Context-aware coherent speaking style prediction with hierarchical transformers for audiobook speech synthesis. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1-5. IEEE. DOI: 10.1109/ICASSP49357.2023.10095866.

Lenglet, M., Perrotin, O., and Bailly, G. (2023). Local style tokens: Fine-grained prosodic representations for tts expressive control. In 12th ISCA Speech Synthesis Workshop (SSW2023), pages 120-126. ISCA. DOI: 10.21437/SSW.2023-19.

Li, H., Zhu, X., Xue, L., Song, Y., Chen, Y., and Xie, L. (2024a). SponTTS: modeling and transferring spontaneous style for TTS. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12171-12175. IEEE. DOI: 10.1109/ICASSP48485.2024.10445828.

Li, T., Wang, X., Xie, Q., Wang, Z., and Xie, L. (2022). Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1448-1460. DOI: 10.1109/TASLP.2022.3164181.

Li, W., Yang, P., Zhong, Y., Zhou, Y., Wang, Z., Wu, Z., Wu, X., and Meng, H. (2024b). Spontaneous style text-to-speech synthesis with controllable spontaneous behaviors based on language models. In Interspeech 2024, pages 1785-1789. DOI: 10.21437/Interspeech.2024-1989.

Li, Y. A., Han, C., Raghavan, V., Mischler, G., and Mesgarani, N. (2023). Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 19594-19621. Curran Associates, Inc. Available at: [link].

Liu, J., Xie, Z., Zhang, C., and Shi, G. (2021a). A novel method for mandarin speech synthesis by inserting prosodic structure prediction into tacotron2. International Journal of Machine Learning and Cybernetics, 12:2809-2823. DOI: 10.1007/s13042-021-01365-x.

Liu, R., Sisman, B., Gao, G., and Li, H. (2021b). Expressive tts training with frame and style reconstruction loss. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:1806–-1818. DOI: 10.1109/TASLP.2021.3076369.

Liu, R., Sisman, B., and Li, H. (2021c). Graphspeech: Syntax-aware graph attention network for neural speech synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6059-6063. IEEE. DOI: 10.1109/ICASSP39728.2021.9413513.

Liu, Z., Wu, N., Zhang, Y., and Ling, Z. (2022). Integrating discrete word-level style variations into non-autoregressive acoustic models for speech synthesis. In Interspeech 2022, pages 5508-5512. DOI: 10.21437/Interspeech.2022-984.

Liu, Z.-C., Chen, L., Hu, Y.-J., Ling, Z.-H., and Pan, J. (2024). Pe-wav2vec: A prosody-enhanced speech model for self-supervised prosody learning in tts. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:4199-4210. DOI: 10.1109/TASLP.2024.3449148.

Liu, Z.-C., Ling, Z.-H., Hu, Y.-J., Pan, J., Wang, J.-W., and Wu, Y.-D. (2023). Speech synthesis with self-supervisedly learnt prosodic representations. In Interspeech 2023, pages 7-11. DOI: 10.21437/Interspeech.2023-1292.

Lucente, L. (2020). Função comunicativa e alinhamento de contorno entoacional descendente. In Anais do I Congresso Brasileiro de Prosódia, volume 1, pages 31-34. Available at: [link].

Luo, X., Takamichi, S., Koriyama, T., Saito, Y., and Saruwatari, H. (2021). Emotion-controllable speech synthesis using emotion soft labels and fine-grained prosody factors. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 794-799. IEEE. DOI: 10.1561/103.00000003.

Ma, F., Li, Y., Xie, Y., He, Y., Zhang, Y., Ren, H., Liu, Z., Yao, W., Ren, F., Yu, F. R., and Ni, S. (2024). A review of human emotion synthesis based on generative technology. DOI: 10.48550/arXiv.2412.07116.

Malviya, S., Mishra, R., Barnwal, S. K., and Tiwary, U. S. (2023). A framework for quality assessment of synthesised speech using learning-based objective evaluation. International Journal of Speech Technology, 26(1):221-243. DOI: 10.1007/s10772-023-10021-4.

Mandeel, A. R., Al-Radhi, M. S., and Csapó, T. G. (2023a). Enhancing end-to-end speech synthesis by modeling interrogative sentences with speaker adaptation. In 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 158-163. IEEE. DOI: 10.1109/SpeD59241.2023.10314910.

Mandeel, A. R., Al-Radhi, M. S., and Csapó, T. G. (2023b). Modeling irregular voice in end-to-end speech synthesis via speaker adaptation. In 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 170-175. IEEE. DOI: 10.1109/SpeD59241.2023.10314920.

Matsubara, K., Okamoto, T., Takashima, R., Takiguchi, T., Toda, T., and Kawai, H. (2023). Harmonic-net: Fundamental frequency and speech rate controllable fast neural vocoder. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1902-1915. DOI: 10.1109/TASLP.2023.3275032.

Mohan, D. S. R., Hu, V., Teh, T. H., Torresquintero, A., Wallis, C. G., Staib, M., Foglianti, L., Gao, J., and King, S. (2021). Ctrl-p: Temporal control of prosodic variation for speech synthesis. In Interspeech 2021, pages 3875-3879. DOI: 10.21437/Interspeech.2021-1583.

Moon, S., Kim, S., and Choi, Y.-H. (2022). Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer. IEEE Access, 10:25455-25463. DOI: 10.1109/ACCESS.2022.3156093.

Moraes, J. A. d. and Rilliard, A. (2022). Entoação. In Jr., M. O., editor, Prosódia, Prosódias: uma introdução, pages 45-66. Editora Contexto. Book.

Moraes, J. d. (1987). Corrélats acoustiques de l’accent de mot en portugais brésilien. In Proceedings XIth ICPhs: The Eleventh International Congress of Phonetic Sciences, August 1-7, 1987, Tallinn, Estonia, U.S.S.R., volume 3, pages 313-316. Academy of Sciences of the Estonian S.S.R. Available at: [link].

Murray, I. R. and Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. The Journal of the Acoustical Society of America, 93(2):1097-1108. DOI: 10.1121/1.405558.

Ogun, S., Colotte, V., and Vincent, E. (2023). Stochastic pitch prediction improves the diversity and naturalness of speech in glow-tts. In Interspeech 2023, pages 4878-4882. DOI: 10.21437/Interspeech.2023-1673.

Oh, H.-S., Lee, S.-H., and Lee, S.-W. (2024). Diffprosody: Diffusion-based latent prosody generation for expressive speech synthesis with prosody conditional adversarial training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2654-2666. DOI: 10.1109/TASLP.2024.3395994.

O'Mahony, J., Lai, C., and King, S. (2023). Synthesising turn-taking cues using natural conversational data. In 12th ISCA Speech Synthesis Workshop (SSW2023), pages 75-80. DOI: 10.21437/SSW.2023-12.

O’Mahony, J., Corkey, N., Lai, C., Klabbers, E., and King, S. (2024). Hierarchical intonation modelling for speech synthesis using legendre polynomial coefficients. In Proc. SpeechProsody 2024, pages 1030-1034. DOI: 10.21437/SpeechProsody.2024-208.

Pamisetty, G. and Sri Rama Murty, K. (2023). Prosody-tts: An end-to-end speech synthesis system with prosody control. Circuits, Systems, and Signal Processing, 42(1):361-384. DOI: 10.1007/s00034-022-02126-z.

Peng, Y. and Ling, Z. (2022). Decoupled pronunciation and prosody modeling in meta-learning-based multilingual speech synthesis. In Interspeech 2022, pages 4257-4261. DOI: 10.21437/Interspeech.2022-831.

Prateek, N. (2023). Data efficiency in neural stylistic speech synthesis. Master's thesis, International Institute of Information Technology - Hyderabad,INDIA. Available at: [link].

Přibil, J., Přibilová, A., and Matoušek, J. (2020). Automatic statistical evaluation of quality of unit selection speech synthesis with different prosody manipulations. Journal of Electrical Engineering, 71(2):78-86. DOI: 10.2478/jee-2020-0012.

Raitio, T., Rasipuram, R., and Castellani, D. (2020). Controllable neural text-to-speech synthesis using intuitive prosodic features. In Interspeech 2020, pages 4432-4436. DOI: 10.21437/Interspeech.2020-2861.

Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. (2021a). Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations. DOI: 10.48550/arXiv.2006.04558.

Ren, Y., Liu, J., and Zhao, Z. (2021b). Portaspeech: portable and high-quality generative text-to-speech. In Proceedings of the 35th International Conference on Neural Information Processing Systems, volume 34 of NIPS '21, pages 13963-13974, Red Hook, NY, USA. Curran Associates Inc.. DOI: https://doi.org/10.48550/arXiv.2109.15166.

Sadekova, T., Kudinov, M., Popov, V., Yermekova, A., and Khrapov, A. (2024). Pitchflow: adding pitch control to a flow-matching based tts model. In Proc. Interspeech 2024, pages 4418-4422. DOI: 10.21437/interspeech.2024-1023.

Sagisaka, Y., Campbell, N., and Higuchi, N. (1997). Computing prosody: computational models for processing spontaneous speech. Springer Science & Business Media. Book.

Santos, V. G., Alves, C., Carlotto, B., Dias, B., Gris, L., Izaias, R., Morais, M. L., Oliveira, P., Sicoli, R., Svartman, F. R. F., et al. (2022). CORAA NURC-SP minimal corpus: a manually annotated corpus of brazilian portuguese spontaneous speech. In 6th International Conference on Speech and Language Technologies on Iberian languages, IberSPEECH, pages 161-165. DOI: 10.21437/IberSPEECH.2022-33.

Sini, A. (2020). Caractérisation et génération de l’expressivité en fonction des styles de parole pour la construction de livres audio. PhD thesis, Rennes 1. Available at: [link].

Takumi, W., Sunao, H., and Masanobu, A. (2024). Explicit prosody control to realize discourse focus in end-to-end text-to-speech. In 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1-6. IEEE. DOI: 10.1109/MLSP58920.2024.10734738.

Tan, D., Deng, L., Yeung, Y. T., Jiang, X., Chen, X., and Lee, T. (2021). Editspeech: A text based speech editing system using partial inference and bidirectional fusion. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 626-633. IEEE. DOI: 10.1109/ASRU51503.2021.9688051.

Taylor, P. (2009). Text-to-speech synthesis. Cambridge University Press. DOI: 10.1017/CBO9780511816338.

The Nguyen, L., Pham, T., and Nguyen, D. Q. (2023). Xphonebert: A pre-trained multilingual model for phoneme representations for text-to-speech. In Interspeech 2023, pages 5506-5510. DOI: 10.21437/Interspeech.2023-444.

Turkmen, T. (2021). Duration modelling for expressive text to speech. PhD thesis, Politecnico di Milano. Available at: [link].

Törö, T. (2022). Analysis of a latent prosody space for controlling speaking styles in finnish end-to-end speech synthesis. Master's thesis, Faculty of Arts - University of Helsinki. Available at: [link].

Van Santen, J. P., Sproat, R., Olive, J., and Hirschberg, J., editors (1997). Progress in speech synthesis. Springer Science & Business Media. DOI: 10.1007/978-1-4612-1894-4.

Wagner, P., Beskow, J., Betz, S., Edlund, J., Gustafson, J., Eje Henter, G., Le Maguer, S., Malisz, Z., Éva Székely, Tånnander, C., and Voße, J. (2019). Speech synthesis evaluation — state-of-the-art assessment and suggestion for a novel research program. In 10th ISCA Workshop on Speech Synthesis (SSW 10), pages 105-110. DOI: 10.21437/SSW.2019-19.

Wang, T., Yi, J., Fu, R., Tao, J., and Wen, Z. (2022a). Campnet: Context-aware mask prediction for end-to-end text-based speech editing. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2241-2254. DOI: 10.1109/TASLP.2022.3190717.

Wang, Y., Xie, Y., Zhao, K., Wang, H., and Zhang, Q. (2022b). Unsupervised quantized prosody representation for controllable speech synthesis. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1-6. IEEE. DOI: 10.1109/ICME52920.2022.9859946.

Wu, N.-Q., Liu, Z.-C., and Ling, Z.-H. (2022). Discourse-level prosody modeling with a variational autoencoder for non-autoregressive expressive speech synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7592-7596. IEEE. DOI: 10.1109/ICASSP43922.2022.9746238.

Wu, Y.-C., Hayashi, T., Okamoto, T., Kawai, H., and Toda, T. (2021a). Quasi-periodic parallel wavegan: A non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:792-806. DOI: 10.1109/TASLP.2021.3061245.

Wu, Y.-C., Hayashi, T., Tobing, P. L., Kobayashi, K., and Toda, T. (2021b). Quasi-periodic wavenet: An autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1134-1148. DOI: 10.1109/TASLP.2021.3061245.

Xiao, Y., Zhang, S., Wang, X., Tan, X., He, L., Zhao, S., Soong, F. K., and Lee, T. (2023). Contextspeech: Expressive and efficient text-to-speech for paragraph reading. In Interspeech 2023, pages 4883-4887. DOI: 10.21437/Interspeech.2023-122.

Xin, D., Adavanne, S., Ang, F., Kulkarni, A., Takamichi, S., and Saruwatari, H. (2023). Improving speech prosody of audiobook text-to-speech synthesis with acoustic and textual contexts. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1-5. IEEE. DOI: 10.1109/ICASSP49357.2023.10096247.

Xu, C., Moore, B. C., Diao, M., Li, X., and Zheng, C. (2024). Predicting the intelligibility of mandarin chinese with manipulated and intact tonal information for normal-hearing listeners. The Journal of the Acoustical Society of America, 156(5):3088-3101. DOI: 10.1121/10.0034233.

Xue, L., Soong, F. K., Zhang, S., and Xie, L. (2022). Paratts: Learning linguistic and prosodic cross-sentence information in paragraph-based tts. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2854-2864. DOI: 10.1109/TASLP.2022.3202126.

Yanagita, T., Sakti, S., and Nakamura, S. (2023). Japanese neural incremental text-to-speech synthesis framework with an accent phrase input. IEEE Access, 11:22355-22363. DOI: 10.1109/ACCESS.2023.3251657.

Yang, F., Luan, J., and Wang, Y. (2022). Improving emotional speech synthesis by using sus-constrained vae and text encoder aggregation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8302-8306. IEEE. DOI: 10.1109/ICASSP43922.2022.9746994.

Yang, F., Yang, S., Wu, Q., Wang, Y., and Xie, L. (2020). Exploiting deep sentential context for expressive end-to-end speech synthesis. In Interspeech 2020, pages 3436-3440. DOI: 10.21437/Interspeech.2020-2423.

Yasuda, Y. (2021). Lexical pitch accent and duration modeling for neural end-to-end text-to-speech synthesis. PhD thesis, Graduate University for Advanced Studies, Japan. Available at: [link].

Ye, Z., Ju, Z., Liu, H., Tan, X., Chen, J., Lu, Y., Sun, P., Pan, J., Bian, W., He, S., et al. (2024). Flashspeech: Efficient zero-shot speech synthesis. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 6998-7007, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3664647.3681044.

Yoneyama, R., Wu, Y.-C., and Toda, T. (2023). High-fidelity and pitch-controllable neural vocoder based on unified source-filter networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing. DOI: 10.1109/TASLP.2023.3313410.

Yufune, K., Koriyama, T., Takamichi, S., and Saruwatari, H. (2021). Accent modeling of low-resourced dialect in pitch accent language using variational autoencoder. Proc. SSW, pages 189-194. DOI: 10.21437/SSW.2021-33.

Zangar, I., Mnasri, Z., Colotte, V., and Jouvet, D. (2021). Duration modelling and evaluation for arabic statistical parametric speech synthesis. Multimedia Tools and Applications, 80:8331-8353. DOI: 10.1007/s11042-020-09901-7.

Zhang, G., Qin, Y., Tan, D., and Lee, T. (2021). Applying the information bottleneck principle to prosodic representation learning. In Interspeech 2021, pages 3156-3160. DOI: 10.21437/Interspeech.2021-1049.

Zhang, M., Zhou, X., Wu, Z., and Li, H. (2023a). Towards zero-shot multi-speaker multi-accent text-to-speech synthesis. IEEE Signal Processing Letters, 30:947-951. DOI: 10.1109/LSP.2023.3292740.

Zhang, M., Zhou, Y., Wu, Z., and Li, H. (2023b). Zero-shot multi-speaker accent tts with limited accent data. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1931-1936. IEEE. DOI: 10.1109/APSIPAASC58517.2023.10317526.

Zhang, Y.-J., Zhang, C., Song, W., Zhang, Z., Wu, Y., and He, X. (2023c). Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2812-2823. DOI: 10.1109/TASLP.2023.3278184.

Zhao, Z., Chen, X., Liu, H., Wang, X., Yang, L., and Wang, J. (2021). SPTTS: Parallel speech synthesis without extra aligner model. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 864-869. IEEE. Available at: [link].

Zhou, X., Zhang, M., Zhou, Y., Wu, Z., and Li, H. (2024). Accented text-to-speech synthesis with limited data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1699-1711. DOI: 10.1109/TASLP.2024.3363414.

Zou, Y., Liu, S., Yin, X., Lin, H., Wang, C., Zhang, H., and Ma, Z. (2021). Fine-grained prosody modeling in neural speech synthesis using tobi representation. In Interspeech 2021, pages 3146-3150. DOI: 10.21437/Interspeech.2021-883.

Downloads

Published

2025-07-09

How to Cite

Galdino, J. C., Matos, A. N., Svartman, F. R. F., & Aluisio, S. M. (2025). The evaluation of prosody in speech synthesis: a systematic review. Journal of the Brazilian Computer Society, 31(1), 466–487. https://doi.org/10.5753/jbcs.2025.5468

Issue

Section

Articles