Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients
This addresses the problem of inefficient and limited supervised attribution methods for model interpretability, offering a scalable, unsupervised alternative for researchers and practitioners.
The paper tackles the mismatch between per-document training data attribution and how models learn broad concepts, by introducing Gradient Atoms, an unsupervised method that decomposes training gradients into sparse components to discover interpretable behaviors like refusal and arithmetic, achieving shifts such as bulleted-list generation from 33% to 94%.
Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Among the 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal, arithmetic, yes/no classification, trivia QA -- without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query--document scoring stage, and scales independently of the number of query behaviors of interest. Code is here: https://github.com/jrosseruk/gradient_atoms