CLAILGDec 19, 2022

Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

arXiv:2212.09811v3238 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses memory efficiency for deploying large-scale multilingual models, making them more accessible, though it is incremental as it builds on existing NLLB-200 architecture.

The authors tackled the problem of high computational requirements for the NLLB-200 multilingual translation model by proposing a pruning method that removes up to 80% of experts, enabling inference on a single 32GB GPU with negligible quality loss.

The recently released NLLB-200 is a set of multilingual Neural Machine Translation models that cover 202 languages. The largest model is based on a Mixture of Experts architecture and achieves SoTA results across many language pairs. It contains 54.5B parameters and requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that enables the removal of up to 80% of experts without further finetuning and with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics can identify language-specific experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes