LGCLNEDec 13, 2023

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

arXiv:2312.07987v338 citationsh-index: 22NIPS
Originality Highly original
AI Analysis

This addresses the computational bottleneck in large language models for AI researchers and practitioners, offering a novel approach to accelerate Transformers.

The authors tackled the problem of inefficient self-attention in Transformers by introducing SwitchHead, a Mixture-of-Experts method for attention layers that reduces compute and memory usage while matching baseline performance. For a 262M parameter model, it achieved 44% compute and 27% memory usage with comparable perplexity and over 3.5% absolute improvement on BliMP.

Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes