AS SDSep 7, 2020

An End-to-end Architecture of Online Multi-channel Speech Separation

Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie

arXiv:2009.03141v110.321 citations

Originality Incremental advance

AI Analysis

This work addresses speech overlap in conversation transcription, which is a key challenge for improving multi-speaker recognition systems, though it is incremental as it builds on prior UFE work.

The paper tackles the problem of multi-speaker speech recognition by introducing an end-to-end version of the UFE system for speech separation, achieving comparable offline performance and remarkable improvements in online evaluation.

Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is processed by fixed beamformers, followed by aneural network post filtering. Although promising results wereobtained, the system contains multiple individually developedmodules, leading potentially sub-optimum performance. In thiswork, we introduce an end-to-end modeling version of UFE. Toenable gradient propagation all the way, an attentional selectionmodule is proposed, where an attentional weight is learnt foreach beamformer and spatial feature sampled over space. Ex-perimental results show that the proposed system achieves com-parable performance in an offline evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation.

View on arXiv PDF

Similar