Perception and classification of emotions in nonsense speech: humans versus machines

  • This article contributes to a more adequate modelling of emotions encoded in speech, by addressing four fallacies prevalent in traditional affective computing: First, studies concentrate on few emotions and disregard all other ones (‘closed world’). Second, studies use clean (lab) data or real-life ones but do not compare clean and noisy data in a comparable setting (‘clean world’). Third, machine learning approaches need large amounts of data; however, their performance has not yet been assessed by systematically comparing different approaches and different sizes of databases (‘small world’). Fourth, although human annotations of emotion constitute the basis for automatic classification, human perception and machine classification have not yet been compared on a strict basis (‘one world’). Finally, we deal with the intrinsic ambiguities of emotions by interpreting the confusions between categories (‘fuzzy world’). We use acted nonsense speech from the GEMEP corpus, emotionalThis article contributes to a more adequate modelling of emotions encoded in speech, by addressing four fallacies prevalent in traditional affective computing: First, studies concentrate on few emotions and disregard all other ones (‘closed world’). Second, studies use clean (lab) data or real-life ones but do not compare clean and noisy data in a comparable setting (‘clean world’). Third, machine learning approaches need large amounts of data; however, their performance has not yet been assessed by systematically comparing different approaches and different sizes of databases (‘small world’). Fourth, although human annotations of emotion constitute the basis for automatic classification, human perception and machine classification have not yet been compared on a strict basis (‘one world’). Finally, we deal with the intrinsic ambiguities of emotions by interpreting the confusions between categories (‘fuzzy world’). We use acted nonsense speech from the GEMEP corpus, emotional ‘distractors’ as categories not entailed in the test set, real-life noises that mask the clear recordings, and different sizes of the training set for machine learning. We show that machine learning based on state-of-the-art feature representations (wav2vec2) is able to mirror the main emotional categories (‘pillars’) present in perceptual emotional constellations even in degradated acoustic conditions.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Emilia Parada-CabaleiroORCiD, Anton BatlinerGND, Maximilian SchmittORCiD, Markus Schedl, Giovanni Costantini, Björn SchullerORCiDGND
URN:urn:nbn:de:bvb:384-opus4-1034199
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/103419
ISSN:1932-6203OPAC
Parent Title (English):PLoS ONE
Publisher:Public Library of Science (PLoS)
Place of publication:San Francisco, CA
Type:Article
Language:English
Year of first Publication:2023
Publishing Institution:Universität Augsburg
Release Date:2023/04/07
Volume:18
Issue:1
First Page:e0281079
DOI:https://doi.org/10.1371/journal.pone.0281079
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)