Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition

Zhao, Ziping; Bao, Zhongtian; Zhao, Yiqin; Zhang, Zixing; Cummins, Nicholas; Ren, Zhao; Schuller, Björn

doi:10.1109/access.2019.2928625

Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition

Ziping Zhao, Zhongtian Bao, Yiqin Zhao, Zixing Zhang, Nicholas Cummins, Zhao Ren, Björn Schuller

The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human-machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features. Results from these works have demonstrated the importance of discriminative spatio-temporal features to model the continual evolutions of different emotions. Recently, spectrogram representations of emotional speech have achieved competitive performance for automatic speech emotion recognition (SER). How machine learning algorithms learn the effective compositional spatio-temporal dynamics for SER has been a fundamental problem of deep representations, herein denoted as deep spectrum representations. In this paper, we develop a model to alleviate this limitation by leveraging a parallel combination of attention-based bidirectional longThe automatic detection of an emotional state from human speech, which plays a crucial role in the area of human-machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features. Results from these works have demonstrated the importance of discriminative spatio-temporal features to model the continual evolutions of different emotions. Recently, spectrogram representations of emotional speech have achieved competitive performance for automatic speech emotion recognition (SER). How machine learning algorithms learn the effective compositional spatio-temporal dynamics for SER has been a fundamental problem of deep representations, herein denoted as deep spectrum representations. In this paper, we develop a model to alleviate this limitation by leveraging a parallel combination of attention-based bidirectional long short-term memory recurrent neural networks with attention-based fully convolutional networks (FCN). The extensive experiments were undertaken on the interactive emotional dyadic motion capture (IEMOCAP) and FAU aibo emotion corpus (FAU-AEC) to highlight the effectiveness of our approach. The experimental results indicate that deep spectrum representations extracted from the proposed model are well-suited to the task of SER, achieving a WA of 68.1% and a UA of 67.0% on IEMOCAP, and 45.4% for UA on FAU-AEC dataset. Key results indicate that the extracted deep representations combined with a linear support vector classifier are comparable in performance with eGeMAPS and COMPARE, two standard acoustic feature representations.…

Metadaten
Author:	Ziping Zhao, Zhongtian Bao, Yiqin Zhao, Zixing Zhang, Nicholas Cummins ORCiD GND, Zhao Ren ORCiD, Björn Schuller ORCiD GND
URN:	urn:nbn:de:bvb:384-opus4-611451
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/61145
ISSN:	2169-3536OPAC
Parent Title (English):	IEEE Access
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Place of publication:	New York, NY
Type:	Article
Language:	English
Year of first Publication:	2019
Publishing Institution:	Universität Augsburg
Release Date:	2019/08/28
Tag:	General Computer Science; General Engineering; General Materials Science
Volume:	7
First Page:	97515
Last Page:	97525
DOI:	https://doi.org/10.1109/access.2019.2928625
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY 4.0: Creative Commons: Namensnennung

Open Access

Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition

Download full text files

Export metadata

Statistics

Additional Services