AS CL LG SDSep 15, 2023

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription

Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

arXiv:2309.08454v22.31 citationsh-index: 25

Originality Incremental advance

AI Analysis

This work addresses speech separation for meeting transcription, an incremental improvement over existing methods.

The authors tackled continuous speech separation for meeting transcription by extending a mixture encoder from static two-speaker scenarios to natural meetings with arbitrary speakers and overlap, integrating it with TF-GridNet separators. They achieved state-of-the-art performance on LibriCSS with a single microphone, showing TF-GridNet closes the gap to oracle separation regardless of mixture encoding.

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

View on arXiv PDF

Similar