Single-channel speech separation with auxiliary speaker embeddings
- We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker embeddings created from additional clean context recordings of the two speakers as input to assist in attributing the different time-frequency bins to the two speakers. In experiments, we show that the proposed model yields good performance in the source separation task, and outperforms the state-of-the-art baselines. Specifically, separating speech from the challenging VoxCeleb dataset, the proposed model yields 4.79dB signal-to-distortion ratio, 8.44dB signal-to-artifacts ratio and 7.11dB signal-to-interference ratio.
| Author: | Shuo Liu, Gil Keren, Björn SchullerORCiDGND |
|---|---|
| URN: | urn:nbn:de:bvb:384-opus4-717556 |
| Frontdoor URL | https://opus.bibliothek.uni-augsburg.de/opus4/71755 |
| Parent Title (English): | arXiv |
| Type: | Preprint |
| Language: | English |
| Date of Publication (online): | 2020/03/03 |
| Year of first Publication: | 2019 |
| Publishing Institution: | Universität Augsburg |
| Release Date: | 2020/03/03 |
| First Page: | arXiv:1906.09997 |
| DOI: | https://doi.org/10.48550/arXiv.1906.09997 |
| Institutes: | Fakultät für Angewandte Informatik |
| Fakultät für Angewandte Informatik / Institut für Informatik | |
| Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Embedded Intelligence for Health Care and Wellbeing | |
| Dewey Decimal Classification: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
| Licence (German): | Deutsches Urheberrecht |



