Towards Spectroscopy: Susceptibility Clusters in Language Models
This provides a novel method for understanding language model internals, which is incremental as it builds on existing perturbation and clustering techniques.
The authors tackled the problem of interpreting internal structures in language models by applying spectroscopy principles, measuring model responses to token perturbations via susceptibilities, and identified 510 interpretable clusters in Pythia-14M, with 50% matching features from sparse autoencoders.
Spectroscopy infers the internal structure of physical systems by measuring their response to perturbations. We apply this principle to neural networks: perturbing the data distribution by upweighting a token $y$ in context $x$, we measure the model's response via susceptibilities $χ_{xy}$, which are covariances between component-level observables and the perturbation computed over a localized Gibbs posterior via stochastic gradient Langevin dynamics (SGLD). Theoretically, we show that susceptibilities decompose as a sum over modes of the data distribution, explaining why tokens that follow their contexts "for similar reasons" cluster together in susceptibility space. Empirically, we apply this methodology to Pythia-14M, developing a conductance-based clustering algorithm that identifies 510 interpretable clusters ranging from grammatical patterns to code structure to mathematical notation. Comparing to sparse autoencoders, 50% of our clusters match SAE features, validating that both methods recover similar structure.