Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
This work addresses speech recognition for multilingual applications by introducing a novel architecture that outperforms strong baselines with fewer parameters, though it is incremental in combining existing techniques like MoE and Conformer.
The paper tackles automatic speech recognition by proposing a decoder-only Conformer model with modality-aware sparse mixtures of experts, which processes speech and text in a single stack without external encoders or pretrained LLMs. It achieves improved word error rates, such as reducing WER from 3.2% to 2.8% on Librispeech test-clean and from 12.2% to 10.6% on Common Voice 16.1 across five languages.
We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.