CVFeb 10, 2025

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

Microsoft
arXiv:2502.06755v120 citationsh-index: 42
Originality Incremental advance
AI Analysis

This provides a tool for researchers and practitioners to understand and control vision model behavior, though it is incremental as it builds on existing sparse autoencoder methods.

The paper tackles the problem of interpreting and validating learned features in vision models by introducing a unified framework using sparse autoencoders (SAEs) that enables discovery of human-interpretable features and precise manipulation for hypothesis testing, revealing differences in semantic abstractions across models with different pre-training objectives.

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes