Characterizing emotional prosody using human-in-the-loop algorithms

van Rijn, Pol

search hit 2 of 4096

Characterizing emotional prosody using human-in-the-loop algorithms

Speech conveys rich information beyond the spoken content including inferential cues about the speaker’s intentions, personality, conversational goals, and emotions. This information is conveyed through prosody, characterized by variations in pitch, loudness, timing, and voice quality. Emotional prosody, in particular, is about how people speak when they are expressing emotions. The communication of emotions is crucial for successful communication in human-computer and human-robot interaction, which requires large datasets of emotional speech. In this thesis, we identify three core methodological problems in creating such corpora and propose solutions to them: obtaining a representative sample of all emotional prosodies (stimulus selection problem), identifying appropriate emotion annotation (taxonomy curation problem), and aligning emotional concepts across languages (lost-in-translation problem). This thesis consists of three parts. In the first part, we developSpeech conveys rich information beyond the spoken content including inferential cues about the speaker’s intentions, personality, conversational goals, and emotions. This information is conveyed through prosody, characterized by variations in pitch, loudness, timing, and voice quality. Emotional prosody, in particular, is about how people speak when they are expressing emotions. The communication of emotions is crucial for successful communication in human-computer and human-robot interaction, which requires large datasets of emotional speech. In this thesis, we identify three core methodological problems in creating such corpora and propose solutions to them: obtaining a representative sample of all emotional prosodies (stimulus selection problem), identifying appropriate emotion annotation (taxonomy curation problem), and aligning emotional concepts across languages (lost-in-translation problem). This thesis consists of three parts. In the first part, we develop Human-In-The-Loop (HITL) algorithms that provide solutions to the identified problems in emotional prosody.While corpora only indirectly capture the association between prosody (stimulus space) and emotions (semantic space), the actual association is stored in the minds of humans. HITL algorithms can sample this information directly from humans, by incorporating human decisions into computer algorithms. In particular, sampling algorithms from machine learning are used to iteratively characterize high-dimensional probability distributions. Here, we incorporate humans as part of the iterative procedure to obtain representative and diverse samples of stimuli over a distribution of latent concepts in human minds, such as the joint distribution of prosodic features and emotions. Concretely, we propose three HITL algorithms: (i) Gibbs Sampling with People (GSP) to efficiently find instances of prosody that sound like a particular emotion using a voice model, (ii) Genetic Algorithm with People (GAP) to obtain a diverse set of emotional recordings through the process of mutation and selection, and (iii) Sequential Transmission Evaluation Pipeline (STEP) to distill a taxonomy of emotions from prosody. While the first two algorithms provide solutions to the stimulus selection problem, the last algorithm provides a solution to the taxonomy curation problem. In the second part of the thesis, I establish an infrastructure to run massive online experiments across the globe. This infrastructure allows deploying the algorithms across languages, providing a solution to the lost-in-translation problem. We benchmark the created infrastructure by running a large-scale, cross-lingual experiment in a low-dimensional and well-studied domain. We recognize that these three problems identified for emotional prosody are pervasive and exist for most machine learning datasets. For example, when constructing a corpus for object recognition, one has to select a representative sample of objects, decide on a taxonomy to label the objects, and for multilingual datasets decide how to align those taxonomies. In the last part of the thesis, we demonstrate that these HITL algorithms, which have been developed to solve core scientific problems in emotional prosody, can be applied in adjacent domains. In particular, we show how GSP can be used for voice personalization for digital agents and avatars, and we demonstrate how the combination of GSP and STEP can be used to align impressions of robots across the auditory and visual modality. The HITL algorithms developed in this thesis enable the creation of large-scale, high-quality datasets, by leveraging human decisions to more directly sample from the associations between the stimulus and the semantic space. In a broader context, these algorithms allow the creation of more representative corpora that can be used to train machine learning models that are more balanced and diverse and can be used to benchmark the performance of state-of-the-art models.…
Sprache vermittelt weit mehr als nur den gesprochenen Inhalt – sie enthält auch Informationen über die Absichten, die Persönlichkeit, die Ziele und die Emotionen eines Sprechers. Diese zusätzlichen Informationen werden über die Prosodie übermittelt, die sich durch Variationen in Tonhöhe, Lautstärke, Timing und Stimmqualität auszeichnet. Emotionale Prosodie beschreibt insbesondere die Art und Weise, wie Emotionen in der Sprache ausgedrückt werden. Die erfolgreiche Kommunikation von Emotionen ist ein zentraler Bestandteil der Mensch-Computer- und Mensch-Roboter-Interaktion, setzt jedoch die Verfügbarkeit großer, qualitativ hochwertiger Datensätze mit emotionalen Sprachaufnahmen voraus. Diese Arbeit identifiziert drei methodische Kernherausforderungen bei der Erstellung solcher Korpora und schlägt entsprechende Lösungen vor: (i) die Gewinnung einer repräsentativen Stichprobe aller emotionalen Prosodien (Stimulus-Selektion), die Identifikation geeigneter EmotionsannotationenSprache vermittelt weit mehr als nur den gesprochenen Inhalt – sie enthält auch Informationen über die Absichten, die Persönlichkeit, die Ziele und die Emotionen eines Sprechers. Diese zusätzlichen Informationen werden über die Prosodie übermittelt, die sich durch Variationen in Tonhöhe, Lautstärke, Timing und Stimmqualität auszeichnet. Emotionale Prosodie beschreibt insbesondere die Art und Weise, wie Emotionen in der Sprache ausgedrückt werden. Die erfolgreiche Kommunikation von Emotionen ist ein zentraler Bestandteil der Mensch-Computer- und Mensch-Roboter-Interaktion, setzt jedoch die Verfügbarkeit großer, qualitativ hochwertiger Datensätze mit emotionalen Sprachaufnahmen voraus. Diese Arbeit identifiziert drei methodische Kernherausforderungen bei der Erstellung solcher Korpora und schlägt entsprechende Lösungen vor: (i) die Gewinnung einer repräsentativen Stichprobe aller emotionalen Prosodien (Stimulus-Selektion), die Identifikation geeigneter Emotionsannotationen (Taxonomie-Kuration) und (iii) die Identifikation und Abstimmung emotionaler Konzepte in verschiedenen Sprachen („Lost-in-Translation“-Problem). Diese Dissertation gliedert sich in drei Teile. Im ersten Teil entwickle ich HITL Algorithmen, die Lösungen für die identifizierten Probleme bieten. Sprachkorpora erfassen nur indirekt die Assoziation zwischen Prosodie (Stimulusraum) und Emotionen (semantischer Raum), während diese Assoziation eigentlich im menschlichen Gehirn gespeichert ist. HITL-Algorithmen ermöglichen es, diese latenten Assoziationen zu extrahieren, indem menschliche Entscheidungen in den Algorithmus integriert werden. Hierzu nutze ich Sampling-Algorithmen aus dem Bereich des maschinellen Lernens, um iterativ hochdimensionale Wahrscheinlichkeitsverteilungen zu beschreiben. In solchen Verfahren, werden Menschen aktiv eingebunden, um repräsentative und vielfältige Stichproben von Stimuli über die gemeinsame Verteilung prosodischer Merkmale und Emotionen zu generieren. Konkret schlage ich drei HITL-Algorithmen vor: (i) GSP – ein effizientes Verfahren zur Identifikation von Prosodien für bestimmte Emotionen mithilfe eines Sprachmodells. (ii) GAP – ein evolutionärer Algorithmus zur Gewinnung vielfältiger emotionaler Sprachaufnahmen. Und (iii) STEP – ein Verfahren zur Ableitung einer Emotions-Taxonomie aus Sprachkorpora. Während GSP und GAP das Problem der Stimulus-Selektion adressieren, dient STEP der Lösung der Taxonomie-Kuration. Im zweiten Teil der Arbeit entwickle ich eine Infrastruktur für groß angelegte Online-Studien. Diese Infrastruktur ermöglicht es, die entwickelten Algorithmen weltweit anzuwenden. Insbesondere ermöglicht die sprachübergreifende Anwendung von STEP die Untersuchung emotionaler Konzepte in verschiedenen Sprachen, womit das „Lost-in-Translation“-Problem adressiert wird. Die Infrastruktur wird in einem groß angelegten, sprachübergreifenden Experiment evaluiert. Im letzten Teil der Arbeit wende ich die entwickelte Algorithmen auf angrenzende Forschungsgebiete an. So zeige ich, wie GSP zur Personalisierung von Stimmen von digitalen Agenten und Avataren genutzt werden kann. Zudem demonstriere ich, wie die Kombination aus GSP und STEP dazu beitragen kann, Eindrücke von Robotern aus verschiedenen Modalitäten (auditiv und visuell) aufeinander abzustimmen. Die in dieser Arbeit entwickelten HITL-Algorithmen ermöglichen die Erstellung groß angelegter, qualitativ hochwertiger Datensätze, indem sie menschliche Entscheidungen gezielt zur effizienteren Erfassung der Assoziationen zwischen Stimulus- und semantischem Raum nutzen. In einem breiteren Kontext tragen diese Methoden zur Entwicklung repräsentativerer Korpora bei, die für das Training ausgewogener und diverserer maschineller Lernmodelle verwendet werden können. Dadurch verbessern sie nicht nur die Benchmarking-Leistung moderner Modelle, sondern leisten auch einen wichtigen Beitrag zur besseren Erfassung und Nutzung emotionaler Prosodie in technischen Systemen.…

Metadaten
Author:	Pol van Rijn ORCiD
URN:	urn:nbn:de:bvb:384-opus4-1232135
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/123213
Advisor:	Elisabeth André
Type:	Doctoral Thesis
Language:	English
Date of Publication (online):	2025/09/09
Year of first Publication:	2025
Publishing Institution:	Universität Augsburg
Granting Institution:	Universität Augsburg, Fakultät für Angewandte Informatik
Date of final exam:	2025/06/23
Release Date:	2025/09/09
Tag:	emotion; emotional prosody; human-in-the-loop; prosody; semantic space
GND-Keyword:	Prosodie; Gefühl; Mensch-Maschine-Kommunikation
Page Number:	144
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Menschzentrierte Künstliche Intelligenz
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY-SA 4.0: Creative Commons: Namensnennung - Weitergabe unter gleichen Bedingungen (mit Print on Demand)

Open Access

Characterizing emotional prosody using human-in-the-loop algorithms

Download full text files

Export metadata

Statistics

Print On Demand

Additional Services