CLLGFeb 17, 2022

ST-MoE: Designing Stable and Transferable Sparse Expert Models

arXiv:2202.08906v2396 citations
AI Analysis

This work addresses scalability and efficiency issues for large language models, enabling broader application of sparse models in NLP, though it is incremental as it builds on existing MoE and Switch Transformer frameworks.

The paper tackled training instabilities and uncertain fine-tuning quality in sparse Mixture-of-Experts models, resulting in ST-MoE-32B, a 269B parameter sparse model that achieves state-of-the-art performance across diverse NLP tasks like reasoning, summarization, and question answering.

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes