ASSDApr 14, 2020

Two-stage model and optimal SI-SNR for monaural multi-speaker speech separation in noisy environment

arXiv:2004.06332v25 citations
AI Analysis

This addresses speech separation in realistic noisy conditions, an incremental advance over prior work focused on clean laboratory settings.

The paper tackles monaural multi-speaker speech separation in noisy environments by proposing a two-stage model based on conv-TasNet and a new objective function called optimal SI-SNR, resulting in substantial performance improvements over one-stage baselines.

In daily listening environments, speech is always distorted by background noise, room reverberation and interference speakers. With the developing of deep learning approaches, much progress has been performed on monaural multi-speaker speech separation. Nevertheless, most studies in this area focus on a simple problem setup of laboratory environment, which background noises and room reverberations are not considered. In this paper, we propose a two-stage model based on conv-TasNet to deal with the notable effects of noises and interference speakers separately, where enhancement and separation are conducted sequentially using deep dilated temporal convolutional networks (TCN). In addition, we develop a new objective function named optimal scale-invariant signal-noise ratio (OSI-SNR), which are better than original SI-SNR at any circumstances. By jointly training the two-stage model with OSI-SNR, our algorithm outperforms one-stage separation baselines substantially.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes