AS SDApr 14, 2020

Two-stage model and optimal SI-SNR for monaural multi-speaker speech separation in noisy environment

arXiv:2004.06332v23.35 citations

Originality Incremental advance

AI Analysis

This addresses speech separation in realistic noisy conditions, an incremental advance over prior work focused on clean laboratory settings.

The paper tackles monaural multi-speaker speech separation in noisy environments by proposing a two-stage model based on conv-TasNet and a new objective function called optimal SI-SNR, resulting in substantial performance improvements over one-stage baselines.

In daily listening environments, speech is always distorted by background noise, room reverberation and interference speakers. With the developing of deep learning approaches, much progress has been performed on monaural multi-speaker speech separation. Nevertheless, most studies in this area focus on a simple problem setup of laboratory environment, which background noises and room reverberations are not considered. In this paper, we propose a two-stage model based on conv-TasNet to deal with the notable effects of noises and interference speakers separately, where enhancement and separation are conducted sequentially using deep dilated temporal convolutional networks (TCN). In addition, we develop a new objective function named optimal scale-invariant signal-noise ratio (OSI-SNR), which are better than original SI-SNR at any circumstances. By jointly training the two-stage model with OSI-SNR, our algorithm outperforms one-stage separation baselines substantially.

View on arXiv PDF

Similar