ASCLLGSDSep 15, 2023

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription

arXiv:2309.08454v21 citationsh-index: 25
Originality Incremental advance
AI Analysis

This work addresses speech separation for meeting transcription, an incremental improvement over existing methods.

The authors tackled continuous speech separation for meeting transcription by extending a mixture encoder from static two-speaker scenarios to natural meetings with arbitrary speakers and overlap, integrating it with TF-GridNet separators. They achieved state-of-the-art performance on LibriCSS with a single microphone, showing TF-GridNet closes the gap to oracle separation regardless of mixture encoding.

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes