ASCLSDFeb 2, 2022

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

arXiv:2202.00842v578 citations
Originality Highly original
AI Analysis

This work addresses the challenge of real-time speech recognition for overlapping conversations, offering a simpler and more efficient model for applications like meeting transcription, though it is incremental over prior streaming multi-talker methods.

The paper tackles the problem of streaming multi-talker automatic speech recognition by proposing token-level serialized output training (t-SOT), which generates tokens from multiple speakers in chronological order with a single output branch, achieving state-of-the-art word error rates on LibriSpeechMix and LibriCSS datasets.

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes