CLAILGJan 29

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

arXiv:2601.21996v11 citationsh-index: 1
Originality Highly original
AI Analysis

This provides a scalable framework for understanding and steering the development of interpretable circuits in LLMs, with implications for mechanistic interpretability research.

The paper tackles the problem of tracing interpretable units in large language models back to their training data origins, introducing Mechanistic Data Attribution (MDA) which uses Influence Functions to identify key training samples; experiments show that targeted interventions on high-influence samples significantly modulate interpretable head emergence (with random interventions showing no effect) and reveal that repetitive structural data acts as a mechanistic catalyst.

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes