CL AIMar 12

Why Attend to Everything? Focus is the Key

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Jin Li, Shuai Shao, Yasin Abbasi Yadkori, Guan Wang, Mingli Yuan, William Chen, Sen Song

arXiv:2604.0326082.3h-index: 5

AI Analysis

This addresses the computational bottleneck in large-scale transformer models for AI researchers and practitioners, offering a retrofit solution that is incremental but provides significant efficiency and performance improvements.

The paper tackles the problem of inefficient attention mechanisms in transformers by introducing Focus, a method that learns to restrict attention to relevant token pairs using learnable centroids, achieving improved perplexity across various model sizes and architectures without degrading downstream performance, with specific gains such as surpassing full attention at 124M parameters (30.3 vs 31.4 PPL) and enabling up to 8.6x speedup at inference.

We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.

View on arXiv PDF

Similar