LGApr 25, 2025

Structural Inference: Interpreting Small Language Models with Susceptibilities

Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet

arXiv:2504.18274v219.79 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses interpretability for researchers and practitioners using small language models, offering a novel method but with incremental impact as it focuses on specific model components.

The authors tackled the problem of interpreting small language models by developing a linear response framework that treats neural networks as Bayesian statistical mechanical systems, resulting in susceptibility-based attribution scores that separate functional modules like multigram and induction heads in a 3M-parameter transformer.

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

View on arXiv PDF

Similar