What Layers When: Learning to Skip Compute in LLMs with Residual Gates
This addresses the high computational cost problem for users of large language models, representing an incremental improvement over existing methods like early-exit or router-based approaches.
The paper tackles the problem of reducing computational cost in large language models by introducing GateSkip, a residual-stream gating mechanism that enables token-wise layer skipping during inference. The result shows up to 15% compute savings while retaining over 90% baseline accuracy on long-form reasoning tasks, with accuracy gains at full compute and matching baseline quality near 50% savings on instruction-tuned models.
We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. For increasingly larger models, this tradeoff improves drastically. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.