Coupling sentiment and arousal analysis towards an affective dialogue manager
- We present the technologies and host components developed to power a speech-based dialogue manager with affective capabilities. The overall goal is that the system adapts its response to the sentiment and arousal level of the user inferred by analysing the linguistic and paralinguistic information embedded in his or her interaction. A linguistic-based, dedicated sentiment analysis component determines the body of the system response. A paralinguistic-based, dedicated arousal recognition component adjusts the energy level to convey in the affective system response. The sentiment analysis model is trained using the CMU-MOSEI dataset and implements a hierarchical contextual attention fusion network, which scores an Unweighted Average Recall (UAR) of 79.04% on the test set when tackling the task as a binary classification problem. The arousal recognition model is trained using the MSP-Podcast corpus. This model extracts the Mel-spectrogram representations of the speech signals, which areWe present the technologies and host components developed to power a speech-based dialogue manager with affective capabilities. The overall goal is that the system adapts its response to the sentiment and arousal level of the user inferred by analysing the linguistic and paralinguistic information embedded in his or her interaction. A linguistic-based, dedicated sentiment analysis component determines the body of the system response. A paralinguistic-based, dedicated arousal recognition component adjusts the energy level to convey in the affective system response. The sentiment analysis model is trained using the CMU-MOSEI dataset and implements a hierarchical contextual attention fusion network, which scores an Unweighted Average Recall (UAR) of 79.04% on the test set when tackling the task as a binary classification problem. The arousal recognition model is trained using the MSP-Podcast corpus. This model extracts the Mel-spectrogram representations of the speech signals, which are exploited with a Convolutional Neural Network (CNN) trained from scratch, and scores a UAR of 61.11% on the test set when tackling the task as a three-class classification problem. Furthermore, we highlight two sample dialogues implemented at the system back-end to detail how the sentiment and arousal inferences are coupled to determine the affective system response. These are also showcased in a proof of concept demonstrator. We publicly release the trained models to provide the research community with off-the-shelf sentiment analysis and arousal recognition tools.…