LGSep 26, 2025

Concept-SAE: Active Causal Probing of Visual Model Behavior

arXiv:2509.22015v1h-index: 8
Originality Highly original
AI Analysis

This work addresses the need for reliable, causal interpretation tools in machine learning, particularly for visual models, offering a validated method to move beyond correlational analysis, though it is incremental in building upon existing SAE techniques.

The paper tackles the problem of ambiguous features in Sparse Autoencoders (SAEs) for interpreting visual models by introducing Concept-SAE, a framework that creates semantically grounded concept tokens through a hybrid disentanglement strategy, resulting in tokens that outperform alternatives in fidelity and localization, enabling causal probing of model behavior and vulnerability analysis.

Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes