LGNEOct 16, 2023

Approximating Two-Layer Feedforward Networks for Efficient Transformers

arXiv:2310.10837v3145 citationsh-index: 100
Originality Incremental advance
AI Analysis

This work addresses resource efficiency for large language models, making MoEs relevant to any-scale models, though it appears incremental as it builds on existing MoE and PKM methods.

The paper tackles the problem of reducing compute and memory requirements in neural networks without performance loss, specifically for Transformers, by introducing a framework to approximate two-layer feedforward networks and improving MoEs and PKMs. It shows that their MoEs are competitive with dense Transformer-XL on WikiText-103 and enwiki8 datasets at two scales while being more resource-efficient.

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes