End-to-end training of time domain audio separation and recognition
This work addresses the challenge of recognizing overlapping speech in noisy environments, which is incremental as it integrates time domain separation with end-to-end recognition for better performance.
The paper tackled the problem of single-channel multi-speaker speech separation and recognition by combining a Conv-TasNet separation module with an end-to-end speech recognizer, achieving a word error rate of 11.0% on WSJ0-2mix and showing substantial improvements over existing cascade and frequency domain systems.
The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.