CL AIFeb 2

Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

Wonjun Lee, Hyounghun Kim, Gary Geunbae Lee

arXiv:2602.01967v10.6h-index: 3

Originality Incremental advance

AI Analysis

This addresses performance degradation in ASR for accented speech, particularly for seen and unseen accents, with incremental improvements over existing methods.

The paper tackled the problem of accented speech recognition, where models trained on high-resource English varieties degrade for other accents, by introducing Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision, achieving up to 29.3% relative WER reduction over baselines on the Mcv-Accent benchmark.

Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

View on arXiv PDF

Similar