Interpreting vision transformers via residual replacement model
This provides interpretability tools for vision transformer researchers, though it appears incremental as it builds on existing sparse autoencoder and circuit analysis methods.
The paper tackles the problem of understanding how vision transformers (ViTs) represent and process information by analyzing 6.6K features across all layers and introducing a residual replacement model that simplifies ViT computations with interpretable features. The result is a scalable framework that produces faithful circuits for human-scale interpretability and demonstrates utility in debiasing spurious correlations.
How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.