ASSDOct 28, 2021

Continuous Speech Separation with Recurrent Selective Attention Network

arXiv:2110.14838v12 citations
Originality Incremental advance
AI Analysis

This work addresses speech separation issues in conversation transcription, offering incremental improvements for applications like automatic speech recognition.

The paper tackled the problem of speech leakages and failures in continuous speech separation by proposing a recurrent selective attention network with block-wise dependency, which improved speech recognition accuracy on the LibriCSS dataset over permutation invariant training-based models.

While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable number of output channels based on active speaker counting. In addition, we propose a novel block-wise dependency extension of RSAN by introducing dependencies between adjacent processing blocks in the CSS framework. It enables the network to utilize the separation results from the previous blocks to facilitate the current block processing. Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over PIT-based models. The proposed block-wise dependency modeling further boosts the performance of RSAN-CSS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes