Multistage linguistic conditioning of convolutional layers for speech emotion recognition

  • Introduction The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline andIntroduction The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behavior. Discussion Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Andreas TriantafyllopoulosORCiD, Uwe Reichel, Shuo Liu, Stephan Huber, Florian Eyben, Björn W. SchullerORCiDGND
URN:urn:nbn:de:bvb:384-opus4-915467
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/91546
Parent Title (English):Frontiers in Computer Science
Publisher:Frontiers Media S.A.
Place of publication:Lausanne
Type:Article
Language:English
Year of first Publication:2023
Publishing Institution:Universität Augsburg
Release Date:2022/01/28
Volume:5
First Page:1072479
DOI:https://doi.org/10.3389/fcomp.2023.1072479
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)