LG AIMay 12

The Routing and Filtering Structure of Attention

arXiv:2605.1882618.7

Predicted impact top 84% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners of transformer-based models, this work provides a principled method to reduce attention complexity and parameter count by exploiting the spectral structure of routing, though the gains are incremental over existing linear attention methods.

The paper decomposes attention into routing and filtering components, showing routing operates at low rank and self-organizes into a spectral cascade. This enables linearizing early layers with minimal perplexity loss (e.g., <5% for first seven layers of a 125M model) and reducing attention parameters by 47-65% with only 3.9-8.4% perplexity increase.

The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce $S$-$D$ attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability ($\mathrm{Re}(λ) \le 0$) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank $2$ at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven layers of 125M $S$-$D$ attention costs ${<}5\%$ perplexity, whereas standard attention collapses under the same intervention. The linearizable region widens with depth. Replacing the first four layers with ELU+1 linear attention reaches within $1.4\%$ of baseline at full head dimension. Cascade-allocated architectures trade attention parameters for perplexity ($47\%-65\%$ fewer attention parameters at $+3.9\%$ to $+8.4\%$ PPL). The routing-filtering decomposition makes the spectral budget legible; the cascade makes it actionable.

View on arXiv PDF

Similar