CLAIMar 14, 2022

Efficient Language Modeling with Sparse all-MLP

Meta AI
arXiv:2203.06850v316 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses efficiency and expressiveness challenges in language modeling for NLP researchers and practitioners, offering an incremental improvement over existing all-MLP and Transformer methods.

The paper tackles the limitations of all-MLP architectures in language modeling by proposing sparsely activated MLPs with mixture-of-experts, which improve perplexity and achieve up to 2× training efficiency gains compared to Transformers and other models, while also enhancing zero-shot in-context learning performance on downstream tasks.

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes