SD ASAug 31, 2017

Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training

Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen

arXiv:1708.09588v111.822 citations

Originality Incremental advance

AI Analysis

This addresses speech separation and denoising for applications like hearing aids or voice assistants, but it is incremental as it builds on existing methods like uPIT and LSTMs.

The paper tackles the problem of separating and denoising noisy multi-talker speech using recurrent neural networks with permutation invariant training, achieving improvements in Signal-to-Distortion Ratio and Extended Short-Time Objective Intelligibility across various noise types and speaker counts.

In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training.

View on arXiv PDF

Similar