GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR
This work addresses the challenge of speaker confusion in overlapping speech recognition, which is crucial for applications like meeting transcription, but it appears incremental as it builds on existing MoE and SOT-based approaches.
The paper tackled the problem of accurately transcribing overlapping speech in multi-talker ASR by proposing the GLAD architecture, which dynamically fuses speaker-aware global context with local acoustic details, resulting in significant performance improvements over existing methods on datasets like LibriSpeechMix and CH109.
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.