AS AI LGJul 14, 2025

ASR-Guided Speaker-Role Diarization and Diarization-Guided ASR Decoding

Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna

arXiv:2507.17765v21.2h-index: 7

Originality Incremental advance

AI Analysis

This work addresses the need for more informative speaker labeling in applications such as healthcare or media, though it is incremental as it builds on prior joint ASR+SD models.

The paper tackles the problem of speaker-role diarization (RD) in joint ASR systems, extending an existing framework to predict roles like doctor vs. patient instead of generic speakers, and achieves a 15% relative reduction in word error rate (WER) on a medical dataset.

From an application standpoint, speaker-role diarization (RD), such as doctor vs. patient, host vs. guest, etc. is often more useful than traditional speaker diarization (SD), which assigns generic labels like speaker-1, speaker-2 etc. In the context of joint automatic speech recognition (ASR) + SD (who spoke what?), recent end-to-end models employ an auxiliary SD transducer, synchronized with the ASR transducer, to predict speakers per word. In this paper, we extend this framework to RD with three key contributions: (1) we simplify the training via forced alignment and cross-entropy loss instead of RNNT loss, (2) we show that word prediction and role prediction require different amounts of predictor's context, leading to separate task-specific predictors, unlike existing shared-predictor models, and (3) we propose a way to leverage RD posterior activity to influence ASR decoding and reduce small-word deletion errors.

View on arXiv PDF

Similar