LGMay 29

On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders

arXiv:2605.3151888.9
Predicted impact top 9% in LG · last 90 daysOriginality Highly original
AI Analysis

This work provides a principled explanation for feature death in sparse autoencoders, a problem that hinders interpretability and efficiency for researchers working with these models, and offers a simple solution.

This paper investigates feature death in sparse autoencoders (SAEs), where many learned features never activate, wasting dictionary capacity. They found that activation outliers, quantified by $\\gamma = \\|\\mu\\|/\\|\\sigma\\|$, cause this issue by shifting pre-activations at initialization, leading to permanently negative pre-activations for anti-aligned features. Mean-centering the activations eliminates outlier-induced feature death across various models.

Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: near-zero on GPT-2, over 70% on AlphaFold3 with identical configurations. We find that dimension-level activation outliers (dimensions whose mean magnitude is large relative to per-token variation) cause this by shifting pre-activations at initialization based on each feature's alignment with the activation mean. Features anti-aligned with the mean receive permanently negative pre-activations and never fire. We formalize outlier severity as $γ= \|μ\|/\|σ\|$; it predicts initial death rates (Spearman $ρ= 0.89$ for dead-by-TopK, $0.82$ for dead-by-ReLU) across 454 model-layer combinations spanning language, vision, protein, and genomic models. Dead features can revive during training, but recovery requires the SAE bias to learn the activation mean, a process that is prohibitively slow at high $γ$. Mean-centering (subtracting the activation mean) sidesteps this and eliminates outlier-induced death across all tested models, confirming the mechanism and providing a principled basis for when and why this preprocessing step is necessary.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes