CLAILGASDec 10, 2021

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

arXiv:2112.05820v322 citations
Originality Incremental advance
AI Analysis

This work addresses improving speech recognition accuracy for multi-lingual applications, representing an incremental advancement by applying an existing technique to specific network architectures.

The paper tackled scaling multi-lingual automatic speech recognition networks using sparsely-gated mixture of experts, achieving relative word error rate reductions of 16.3% for Sequence-to-Sequence Transformer and 4.6% for Transformer Transducer.

The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, the use of language ID and the label decoder with the MoE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes