Decoding Vision Transformers: the Diffusion Steering Lens
This work addresses interpretability for researchers and practitioners using Vision Transformers, offering a more detailed analysis of submodule contributions, though it is incremental as it builds on prior diffusion-based methods.
The paper tackled the limitation of existing interpretability methods for Vision Transformers (ViTs) in capturing submodule contributions, and proposed Diffusion Steering Lens (DSL), a training-free approach that provides intuitive and reliable interpretation of ViT internal processing, validated through interventional studies.
Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.