Deep emotional text-to-speech and voice conversion
- Emotional Speech Synthesis (ESS) is a rapidly evolving field, with significant advancements in both Emotional Text-to-Speech (ETTS) and Emotional Voice Conversion (EVC). These two research areas are integral to the development of ESS, aiming at different application scenarios. This thesis researches into the background, state-of-the-art studies and key concepts of ETTS and EVC, providing a comprehensive analysis of their respective methodologies and implementations.
In the ETTS domain, this work presents the design and experimentation of one neutral TTS system and two distinct ETTS systems. These systems are evaluated on various performance metrics to assess their capability in synthesising speech with emotional expression. The ETTS systems leverage transfer learning, highlighting the effectiveness of enhancing emotional expressivity in synthetic speech.
Conversely, the EVC domain is explored through both frame-to-frame and sequence-to-sequence approaches. Two frame-to-frame EVCEmotional Speech Synthesis (ESS) is a rapidly evolving field, with significant advancements in both Emotional Text-to-Speech (ETTS) and Emotional Voice Conversion (EVC). These two research areas are integral to the development of ESS, aiming at different application scenarios. This thesis researches into the background, state-of-the-art studies and key concepts of ETTS and EVC, providing a comprehensive analysis of their respective methodologies and implementations.
In the ETTS domain, this work presents the design and experimentation of one neutral TTS system and two distinct ETTS systems. These systems are evaluated on various performance metrics to assess their capability in synthesising speech with emotional expression. The ETTS systems leverage transfer learning, highlighting the effectiveness of enhancing emotional expressivity in synthetic speech.
Conversely, the EVC domain is explored through both frame-to-frame and sequence-to-sequence approaches. Two frame-to-frame EVC systems are implemented, focusing on CycleGAN and VAE-GAN models. These two systems are tested and analysed, including objective and subjective evaluations, to determine their performance in converting neutral speech into emotional speech.
Additionally, in order to optimise the speech quality of the converted speech, a sequence-to-sequence EVC systems are developed first, based on an advanced model architecture called Transformer. The experimental results demonstrate the feasibility; however, the findings also result in the necessity for further optimisation to achieve more natural and high-quality output. Challenges such as training strategy, data augmentation and information disentanglement are addressed, offering insights for improvement.
This thesis concludes by outlining the general challenges in ESS, along with an outlook on future developments. The exploration of non-autoregressive models, flow-based TTS and diffusion-based TTS, as well as the integration of large models, are discussed as promising directions for improving ESS. These insights contribute to the ongoing efforts to bridge the gap between state-of-the-art studies and the ultimate goal of achieving the synthesis of natural emotional speech.…
Author: | Zijiang Yang |
---|---|
URN: | urn:nbn:de:bvb:384-opus4-1247424 |
Frontdoor URL | https://opus.bibliothek.uni-augsburg.de/opus4/124742 |
Advisor: | Björn SchullerORCiDGND |
Type: | Doctoral Thesis |
Language: | English |
Date of Publication (online): | 2025/09/23 |
Year of first Publication: | 2025 |
Publishing Institution: | Universität Augsburg |
Granting Institution: | Universität Augsburg, Fakultät für Angewandte Informatik |
Date of final exam: | 2025/06/30 |
Release Date: | 2025/09/23 |
Tag: | Emotional Speech Synthesis, Affective Computing, Deep Learning, Artificial Intelligence |
GND-Keyword: | Automatische Sprachproduktion; Gefühl; Künstliche Intelligenz |
Page Number: | 193 |
Institutes: | Fakultät für Angewandte Informatik |
Fakultät für Angewandte Informatik / Institut für Informatik | |
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing | |
Dewey Decimal Classification: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke |
Licence (German): | ![]() |