LGFeb 15

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

arXiv:2602.14111v13 citations
Originality Incremental advance
AI Analysis

This work challenges the reliability of SAEs for interpreting neural networks, indicating they may be ineffective for researchers and practitioners in AI interpretability.

The paper investigates whether Sparse Autoencoders (SAEs) recover meaningful features from neural network activations, finding that they recover only 9% of true features in synthetic tests and perform similarly to random baselines in real-world evaluations, suggesting they fail at their core decomposition task.

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes