Mathematical Models of Computation in Superposition
This work addresses the problem of interpreting neural networks for AI researchers by showing how superposition enables efficient computation, though it is incremental as it builds on existing theory of representational superposition.
The paper tackles the challenge of mechanistically interpreting AI systems by developing mathematical models of computation in superposition, where superposition actively aids task efficiency, and demonstrates that a 1-layer MLP can emulate a circuit with O(m^(2/3)) neurons for m features up to ε-error, generalizing to sparse boolean circuits and deep networks.
Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emph{representational} superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the $\binom{m}{2}$ pairs of each of $m$ features. We construct a 1-layer MLP that uses superposition to perform this task up to $\varepsilon$-error, where the network only requires $\tilde{O}(m^{\frac{2}{3}})$ neurons, even when the input features are \emph{themselves in superposition}. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width $d$ to emulate circuits of width $\tilde{O}(d^{1.5})$ and \emph{any} polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.