LGAINEMay 10

Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

arXiv:2605.0940322.6
AI Analysis

For researchers studying Transformer internals, this reveals that local FFN design choices have nonlocal effects on attention, challenging assumptions about modularity in small models.

The paper shows that architectural choices in Transformer FFN blocks (dense, GLU, MoE, MoE-GLU) reshape attention computations in small models, with sparse MoE routing shifting computation from FFN to attention, especially in carry-based addition. Frozen random routing nearly matches learned routing, indicating sparsity rather than specialization drives redistribution.

Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes