CLAIDec 15, 2022

Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

arXiv:2212.07571v1225 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses performance degradation in low-resource language translation for multilingual NLP applications, representing an incremental improvement.

The paper tackles over-fitting in Mixture of Experts (MoE) models for low-resource multilingual machine translation by proposing regularization strategies like dropout, conditional routing, and curriculum learning, resulting in about +1 chrF++ improvement for very low-resource language pairs.

Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We show effective regularization strategies, namely dropout techniques for MoE layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods that prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs. We perform an extensive analysis of the learned MoE routing to better understand the impact of our regularization methods and how we can improve them.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes