Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget
This work addresses efficiency and performance issues in large language models by reallocating computational resources, though it is incremental as it builds on existing transformer architectures.
The paper tackles the problem of unnecessary nonlinearity in transformer MLPs by introducing a gating mechanism that replaces full MLPs with linear surrogates when possible, achieving 25-56% linear routing at less than 1% perplexity cost in GPT-2 and up to 17.3% perplexity improvement with optimized linearization.
We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement -- and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.