Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
This work addresses speech separation for meeting transcription, an incremental improvement over existing methods.
The authors tackled continuous speech separation for meeting transcription by extending a mixture encoder from static two-speaker scenarios to natural meetings with arbitrary speakers and overlap, integrating it with TF-GridNet separators. They achieved state-of-the-art performance on LibriCSS with a single microphone, showing TF-GridNet closes the gap to oracle separation regardless of mixture encoding.
Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.