OPUS 4 | Search

AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition ()

Single-channel speech separation with auxiliary speaker embeddings ()

We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker embeddings created from additional clean context recordings of the two speakers as input to assist in attributing the different time-frequency bins to the two speakers. In experiments, we show that the proposed model yields good performance in the source separation task, and outperforms the state-of-the-art baselines. Specifically, separating speech from the challenging VoxCeleb dataset, the proposed model yields 4.79dB signal-to-distortion ratio, 8.44dB signal-to-artifacts ratio and 7.11dB signal-to-interference ratio.

N-HANS: a neural network-based toolkit for in-the-wild audio enhancement ()

An early study on intelligent analysis of speech under COVID-19: severity, sleep quality, fatigue, and anxiety ()

Towards speech robustness for acoustic scene classification ()

Adventitious respiratory classification using attentive residual neural networks ()

COVID-19 detection with a novel multi-type deep fusion method using breathing and coughing information ()

The filtering effect of face masks in their detection from speech ()

Multistage linguistic conditioning of convolutional layers for speech emotion recognition ()

Introduction The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behavior. Discussion Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.

Supervised contrastive learning for game-play frustration detection from speech ()

Deep speaker conditioning for speech emotion recognition ()

A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition ()

Computer audition for fighting the SARS-CoV-2 corona crisis: introducing the multitask speech corpus for COVID-19 ()

Predicting group work performance from physical handwriting features in a smart English classroom ()

Frustration recognition from speech during game interaction using wide residual networks ()

Coughing-based recognition of Covid-19 with spatial attentive ConvLSTM recurrent neural networks ()

Hierarchical component-attention based speaker turn embedding for emotion recognition ()

A review of automatic recognition technology for bird vocalizations in the deep learning era ()

The utility of wearable devices in assessing ambulatory impairments of people with multiple sclerosis in free-living conditions ()