• search hit 5 of 6852
Back to Result List

Deep emotional text-to-speech and voice conversion

  • Emotional Speech Synthesis (ESS) is a rapidly evolving field, with significant advancements in both Emotional Text-to-Speech (ETTS) and Emotional Voice Conversion (EVC). These two research areas are integral to the development of ESS, aiming at different application scenarios. This thesis researches into the background, state-of-the-art studies and key concepts of ETTS and EVC, providing a comprehensive analysis of their respective methodologies and implementations. In the ETTS domain, this work presents the design and experimentation of one neutral TTS system and two distinct ETTS systems. These systems are evaluated on various performance metrics to assess their capability in synthesising speech with emotional expression. The ETTS systems leverage transfer learning, highlighting the effectiveness of enhancing emotional expressivity in synthetic speech. Conversely, the EVC domain is explored through both frame-to-frame and sequence-to-sequence approaches. Two frame-to-frame EVCEmotional Speech Synthesis (ESS) is a rapidly evolving field, with significant advancements in both Emotional Text-to-Speech (ETTS) and Emotional Voice Conversion (EVC). These two research areas are integral to the development of ESS, aiming at different application scenarios. This thesis researches into the background, state-of-the-art studies and key concepts of ETTS and EVC, providing a comprehensive analysis of their respective methodologies and implementations. In the ETTS domain, this work presents the design and experimentation of one neutral TTS system and two distinct ETTS systems. These systems are evaluated on various performance metrics to assess their capability in synthesising speech with emotional expression. The ETTS systems leverage transfer learning, highlighting the effectiveness of enhancing emotional expressivity in synthetic speech. Conversely, the EVC domain is explored through both frame-to-frame and sequence-to-sequence approaches. Two frame-to-frame EVC systems are implemented, focusing on CycleGAN and VAE-GAN models. These two systems are tested and analysed, including objective and subjective evaluations, to determine their performance in converting neutral speech into emotional speech. Additionally, in order to optimise the speech quality of the converted speech, a sequence-to-sequence EVC systems are developed first, based on an advanced model architecture called Transformer. The experimental results demonstrate the feasibility; however, the findings also result in the necessity for further optimisation to achieve more natural and high-quality output. Challenges such as training strategy, data augmentation and information disentanglement are addressed, offering insights for improvement. This thesis concludes by outlining the general challenges in ESS, along with an outlook on future developments. The exploration of non-autoregressive models, flow-based TTS and diffusion-based TTS, as well as the integration of large models, are discussed as promising directions for improving ESS. These insights contribute to the ongoing efforts to bridge the gap between state-of-the-art studies and the ultimate goal of achieving the synthesis of natural emotional speech.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Zijiang Yang
URN:urn:nbn:de:bvb:384-opus4-1247424
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/124742
Advisor:Björn SchullerORCiDGND
Type:Doctoral Thesis
Language:English
Date of Publication (online):2025/09/23
Year of first Publication:2025
Publishing Institution:Universität Augsburg
Granting Institution:Universität Augsburg, Fakultät für Angewandte Informatik
Date of final exam:2025/06/30
Release Date:2025/09/23
Tag:Emotional Speech Synthesis, Affective Computing, Deep Learning, Artificial Intelligence
GND-Keyword:Automatische Sprachproduktion; Gefühl; Künstliche Intelligenz
Page Number:193
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Licence (German):CC-BY-NC 4.0: Creative Commons: Namensnennung - Nicht kommerziell (mit Print on Demand)