Single-Channel Speech Separation with Auxiliary Speaker Embeddings
This work addresses speech separation for applications like hearing aids or transcription by improving performance on challenging datasets, though it is incremental as it builds on existing methods with added speaker embeddings.
The paper tackles the problem of single-channel speech separation by decomposing a signal into two speakers' segments using a neural network with auxiliary speaker embeddings from clean context recordings, achieving 4.79dB SDR, 8.44dB SAR, and 7.11dB SIR on the VoxCeleb dataset.
We present a novel source separation model to decompose asingle-channel speech signal into two speech segments belonging to two different speakers. The proposed model is a neural network based on residual blocks, and uses learnt speaker embeddings created from additional clean context recordings of the two speakers as input to assist in attributing the different time-frequency bins to the two speakers. In experiments, we show that the proposed model yields good performance in the source separation task, and outperforms the state-of-the-art baselines. Specifically, separating speech from the challenging VoxCeleb dataset, the proposed model yields 4.79dB signal-to-distortion ratio, 8.44dB signal-to-artifacts ratio and 7.11dB signal-to-interference ratio.