ASCLSDJun 9, 2025

Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

arXiv:2506.07515v14 citationsh-index: 20INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses multi-talker speech recognition without auxiliary information, offering a practical improvement for applications like transcription of overlapping conversations, though it is incremental as it builds on existing SOT and CTC methods.

The paper tackles the problem of speaker assignment failures in multi-talker speech recognition by proposing Speaker-Distinguishable CTC (SD-CTC), which jointly assigns tokens and speaker labels to frames, and integrating it into the SOT framework; experimental results show a 26% reduction in error rate for the SOT model and performance comparable to state-of-the-art methods.

This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes