Causal Interpretation of Sparse Autoencoder Features in Vision
This work addresses the risk of misinterpretation in feature understanding for vision AI researchers, offering a more accurate method for explaining sparse autoencoder features, though it is incremental as it builds on existing attribution techniques.
The paper tackled the problem of misinterpreting sparse autoencoder features in vision transformers by showing that high-activation patches do not necessarily cause feature firing due to self-attention mixing. They proposed Causal Feature Explanation (CaFE), which uses input-attribution methods to identify causal patches, and found that it yields more faithful and semantically precise explanations, with patch insertion tests confirming its effectiveness over naive activation maps.
Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature's activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature's firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a "roaring face" feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.