LGSep 21, 2022

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish

OpenAI

arXiv:2209.10652v155.2957 citationsh-index: 68Has Code

Originality Highly original

AI Analysis

This work addresses interpretability challenges in neural networks for researchers and practitioners, offering foundational insights into polysemanticity.

The paper tackles the problem of polysemanticity in neural networks, where unrelated concepts are packed into single neurons, by providing a toy model that fully explains this phenomenon as storing sparse features in superposition, and demonstrates a phase change, a connection to uniform polytope geometry, and evidence linking it to adversarial examples.

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

View on arXiv PDF Code

Similar