CLSDASMay 25, 2023

Mixture-of-Expert Conformer for Streaming Multilingual ASR

arXiv:2305.15663v130 citations
Originality Highly original
AI Analysis

This addresses the problem of efficient multilingual speech recognition for on-device deployment, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the high computation cost of large multilingual automatic speech recognition models for on-device applications by proposing a streaming multilingual Conformer with mixture-of-expert layers that activate only a subset of parameters, achieving an average 11.9% relative improvement in WER over the baseline on 12 languages.

End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn to only activate a subset of parameters in training and inference. The MoE layer consists of a softmax gate which chooses the best two experts among many in forward propagation. The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases. We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline. Compared to an adapter model using ground truth information, our MoE model achieves similar WER and activates similar number of parameters but without any language information. We further show around 3% relative WER improvement by multilingual shallow fusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes