Sparse Attention Post-Training for Mechanistic Interpretability
This addresses the challenge of improving interpretability in large language models for researchers and practitioners by making attention patterns more organized and exposing redundant computation.
The paper tackles the problem of making transformer attention sparse without performance loss, achieving a reduction to approximately 0.4% of attention edges while retaining original pretraining loss in models up to 7B parameters, and it shows that this sparsity simplifies task-specific circuits with up to 100x fewer edges.
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.