Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
This work addresses the challenge of interpreting complex neural networks for researchers in mechanistic interpretability, though it is incremental as it builds on existing metrics and methods.
The authors tackled the problem of polysemantic neurons in neural networks by introducing the Polysemanticity Index (PSI), a null-calibrated metric that quantifies when neuron activations decompose into distinct semantic clusters, and validated it on a pretrained ResNet-50 with Tiny-ImageNet, showing that later layers have substantially higher PSI than earlier layers and that causal patch interventions increased target-neuron activation significantly more than controls.
Neural networks often contain polysemantic neurons that respond to multiple, sometimes unrelated, features, complicating mechanistic interpretability. We introduce the Polysemanticity Index (PSI), a null-calibrated metric that quantifies when a neuron's top activations decompose into semantically distinct clusters. PSI multiplies three independently calibrated components: geometric cluster quality (S), alignment to labeled categories (Q), and open-vocabulary semantic distinctness via CLIP (D). On a pretrained ResNet-50 evaluated with Tiny-ImageNet images, PSI identifies neurons whose activation sets split into coherent, nameable prototypes, and reveals strong depth trends: later layers exhibit substantially higher PSI than earlier layers. We validate our approach with robustness checks (varying hyperparameters, random seeds, and cross-encoder text heads), breadth analyses (comparing class-only vs. open-vocabulary concepts), and causal patch-swap interventions. In particular, aligned patch replacements increase target-neuron activation significantly more than non-aligned, random, shuffled-position, or ablate-elsewhere controls. PSI thus offers a principled and practical lever for discovering, quantifying, and studying polysemantic units in neural networks.