LG AIApr 18

Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv:2605.052221.2

Predicted impact top 99% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For transformer practitioners, TSA provides a lightweight, end-to-end differentiable method to reduce computation adaptively without architectural changes, offering practical speedups.

Token-Selective Attention (TSA) learns per-token gates to skip transformer layers adaptively, saving 14-23% of token-layer operations on character-level language modeling with <0.5% quality loss, and achieving 0.7% lower validation loss than early exit at matched efficiency.

Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at $λ=0$ (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at <0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup.

View on arXiv PDF

Similar