Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
This addresses efficiency issues for users deploying large transformer models in natural language processing, offering a practical, incremental improvement over existing sparsification techniques.
The paper tackles the problem of reducing computational and memory costs in transformer models during inference by introducing Top-Theta Attention, a training-free method that sparsifies attention with content-based thresholds, achieving a 3-10x reduction in V-cache usage and up to 10x fewer attention elements with less than 1% accuracy degradation.
We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.