CVLGMay 21, 2025

Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

arXiv:2505.15970v13 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This work provides a framework for systematically analyzing hierarchical representations in vision models, which is incremental as it extends SAEs from language to vision domains.

The study tackled the problem of analyzing how vision models encode the hierarchical structure of the ImageNet taxonomy, using Sparse Autoencoders (SAEs) to probe internal representations, and found that SAEs uncover implicit hierarchical relationships in model activations, revealing consistency across layers in DINOv2.

The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes