On the impact of word error rate on acoustic-linguistic speech emotion recognition: an update for the deep learning era

Amiriparian, Shahin; Sokolov, Artem; Aslan, Ilhan; Christ, Lukas; Gerczuk, Maurice; Hübner, Tobias; Lamanov, Dimitry; Milling, Manuel; Ottl, Sandra; Poduremennykh, Ilya; Shuranov, Evgeniy; Schuller, Björn W.

doi:10.48550/arXiv.2104.10121

The search result changed since you submitted your search request. Documents might be displayed in a different sort order.

search hit 15 of 79

Back to Result List

On the impact of word error rate on acoustic-linguistic speech emotion recognition: an update for the deep learning era

Shahin Amiriparian, Artem Sokolov, Ilhan Aslan, Lukas Christ, Maurice Gerczuk, Tobias Hübner, Dimitry Lamanov, Manuel Milling, Sandra Ottl, Ilya Poduremennykh, Evgeniy Shuranov, Björn W. Schuller

Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the context of fusion with acoustic information exploitation in the age of deep ASR systems. In order to tackle the above issues, we create transcripts from the original speech by applying three modern ASR systems, including an end-to-end model trained with recurrent neural network-transducer loss, a model with connectionist temporal classification loss, and a wav2vec framework for self-supervised learning. Afterwards, we use pre-trained textual models to extract text representations from the ASR outputs and the gold standard. For extraction and learning of acoustic speech features, we utiliseText encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the context of fusion with acoustic information exploitation in the age of deep ASR systems. In order to tackle the above issues, we create transcripts from the original speech by applying three modern ASR systems, including an end-to-end model trained with recurrent neural network-transducer loss, a model with connectionist temporal classification loss, and a wav2vec framework for self-supervised learning. Afterwards, we use pre-trained textual models to extract text representations from the ASR outputs and the gold standard. For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep. Finally, we conduct decision-level fusion on both information streams -- acoustics and linguistics. Using the best development configuration, we achieve state-of-the-art unweighted average recall values of 73.6% and 73.8% on the speaker-independent development and test partitions of IEMOCAP, respectively.…

Metadaten
Author:	Shahin Amiriparian ORCiD GND, Artem Sokolov, Ilhan Aslan ORCiD GND, Lukas Christ, Maurice Gerczuk ORCiD, Tobias Hübner, Dimitry Lamanov, Manuel Milling ORCiD GND, Sandra Ottl, Ilya Poduremennykh, Evgeniy Shuranov, Björn W. Schuller ORCiD GND
URN:	urn:nbn:de:bvb:384-opus4-916029
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/91602
Parent Title (English):	arXiv
Publisher:	arXiv
Type:	Preprint
Language:	English
Date of Publication (online):	2022/01/03
Year of first Publication:	2021
Publishing Institution:	Universität Augsburg
Release Date:	2022/01/28
First Page:	arXiv:2104.10121v1
DOI:	https://doi.org/10.48550/arXiv.2104.10121
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY-NC-SA 4.0: Creative Commons: Namensnennung - Nicht kommerziell - Weitergabe unter gleichen Bedingungen (mit Print on Demand)

Open Access

On the impact of word error rate on acoustic-linguistic speech emotion recognition: an update for the deep learning era

Download full text files

Export metadata

Statistics

Print On Demand

Additional Services