CLJan 27

Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition

arXiv:2601.19451v1
Originality Incremental advance
AI Analysis

This work addresses multilingual ASR for Indic languages, offering an incremental improvement in performance and efficiency.

The paper tackled the problem of multilingual speech recognition by proposing a stabilized Mixture-of-Experts projector to handle diverse acoustic-to-semantic mappings, achieving up to a 7.6% relative WER reduction over a single-projector baseline across four Indic languages.

Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes