LGSep 21, 2022

Toy Models of Superposition

OpenAI
arXiv:2209.10652v1846 citationsh-index: 68
Originality Highly original
AI Analysis

This work addresses interpretability challenges in neural networks for researchers and practitioners, offering foundational insights into polysemanticity.

The paper tackles the problem of polysemanticity in neural networks, where unrelated concepts are packed into single neurons, by providing a toy model that fully explains this phenomenon as storing sparse features in superposition, and demonstrates a phase change, a connection to uniform polytope geometry, and evidence linking it to adversarial examples.

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes