Speech synthesis with mixed emotions

Zhou, Kun; Sisman, Berrak; Rana, Rajib; Schuller, Björn W.; Li, Haizhou

doi:10.1109/taffc.2022.3233324

Speech synthesis with mixed emotions

Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller, Haizhou Li

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, andEmotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.…

Metadaten
Author:	Kun Zhou, Berrak Sisman, Rajib Rana, Björn W. Schuller ORCiD GND, Haizhou Li
URN:	urn:nbn:de:bvb:384-opus4-1120806
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/112080
ISSN:	1949-3045OPAC
ISSN:	2371-9850OPAC
Parent Title (English):	IEEE Transactions on Affective Computing
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Type:	Article
Language:	English
Year of first Publication:	2023
Publishing Institution:	Universität Augsburg
Release Date:	2024/03/19
Tag:	Human-Computer Interaction; Software
Volume:	14
Issue:	4
First Page:	3120
Last Page:	3134
DOI:	https://doi.org/10.1109/taffc.2022.3233324
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY 4.0: Creative Commons: Namensnennung

Open Access

Speech synthesis with mixed emotions

Download full text files

Export metadata

Statistics

Additional Services