CL LGFeb 17, 2022

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus

arXiv:2202.08906v223.6407 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses scalability and efficiency issues for large language models, enabling broader application of sparse models in NLP, though it is incremental as it builds on existing MoE and Switch Transformer frameworks.

The paper tackled training instabilities and uncertain fine-tuning quality in sparse Mixture-of-Experts models, resulting in ST-MoE-32B, a 269B parameter sparse model that achieves state-of-the-art performance across diverse NLP tasks like reasoning, summarization, and question answering.

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

View on arXiv PDF Code

Similar