Probing speech emotion recognition transformers for linguistic knowledge

  • Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and thatLarge, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Andreas TriantafyllopoulosORCiD, Johannes Wagner, Hagen Wierstorf, Maximilian SchmittORCiD, Uwe Reichel, Florian Eyben, Felix Burkhardt, Björn W. SchullerORCiDGND
URN:urn:nbn:de:bvb:384-opus4-992908
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/99290
Parent Title (English):Interspeech 2022, Incheon, Korea, 18-22 September 2022
Publisher:ISCA
Place of publication:Baixas
Editor:Hanseok Ko, John H. L. Hansen
Type:Conference Proceeding
Language:English
Year of first Publication:2022
Publishing Institution:Universität Augsburg
Release Date:2022/11/15
First Page:146
Last Page:150
DOI:https://doi.org/10.21437/interspeech.2022-10371
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):Deutsches Urheberrecht