Toy Models of Superposition
This work addresses interpretability challenges in neural networks for researchers and practitioners, offering foundational insights into polysemanticity.
The paper tackles the problem of polysemanticity in neural networks, where unrelated concepts are packed into single neurons, by providing a toy model that fully explains this phenomenon as storing sparse features in superposition, and demonstrates a phase change, a connection to uniform polytope geometry, and evidence linking it to adversarial examples.
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.