CLAIJan 30

Language Model Circuits Are Sparse in the Neuron Basis

Stanford
arXiv:2601.22594v18 citationsh-index: 14
Originality Incremental advance
AI Analysis

This advances automated interpretability for language models without extra training costs, though it is incremental as it builds on existing neuron-based methods.

The paper tackles the problem of interpreting language models by showing that MLP neurons are as sparse a feature basis as sparse autoencoders, enabling circuit tracing with a pipeline that identifies causal circuits of about 100 neurons for tasks like subject-verb agreement and multi-hop reasoning.

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes