AS CL SDJun 9, 2025

Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

Asahi Sakuma, Hiroaki Sato, Ryuga Sugano, Tadashi Kumano, Yoshihiko Kawai, Tetsuji Ogawa

arXiv:2506.07515v15.15 citationsh-index: 20INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses multi-talker speech recognition without auxiliary information, offering a practical improvement for applications like transcription of overlapping conversations, though it is incremental as it builds on existing SOT and CTC methods.

The paper tackles the problem of speaker assignment failures in multi-talker speech recognition by proposing Speaker-Distinguishable CTC (SD-CTC), which jointly assigns tokens and speaker labels to frames, and integrating it into the SOT framework; experimental results show a 26% reduction in error rate for the SOT model and performance comparable to state-of-the-art methods.

This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.

View on arXiv PDF

Similar