CLAIOct 11, 2023

Sparse Universal Transformer

arXiv:2310.07096v1141 citationsh-index: 25
Originality Highly original
AI Analysis

This addresses the problem of inefficient scaling for researchers and practitioners using Universal Transformers, offering a more compute-efficient alternative with maintained performance.

The paper tackles the high computational and memory costs of scaling Universal Transformers by proposing the Sparse Universal Transformer, which uses Sparse Mixture of Experts and a new halting mechanism to achieve the same performance as baselines with half the computation and parameters on WMT'14 and strong generalization on formal language tasks.

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes