SDLGASJun 24, 2019

Single-Channel Speech Separation with Auxiliary Speaker Embeddings

arXiv:1906.09997v14 citations
Originality Incremental advance
AI Analysis

This work addresses speech separation for applications like hearing aids or transcription by improving performance on challenging datasets, though it is incremental as it builds on existing methods with added speaker embeddings.

The paper tackles the problem of single-channel speech separation by decomposing a signal into two speakers' segments using a neural network with auxiliary speaker embeddings from clean context recordings, achieving 4.79dB SDR, 8.44dB SAR, and 7.11dB SIR on the VoxCeleb dataset.

We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker embeddings created from additional clean context recordings of the two speakers as input to assist in attributing the different time-frequency bins to the two speakers. In experiments, we show that the proposed model yields good performance in the source separation task, and outperforms the state-of-the-art baselines. Specifically, separating speech from the challenging VoxCeleb dataset, the proposed model yields 4.79dB signal-to-distortion ratio, 8.44dB signal-to-artifacts ratio and 7.11dB signal-to-interference ratio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes