Modeling Transformers as complex networks to analyze learning dynamics
This provides a macroscopic lens for mechanistic interpretability in AI, offering insights into learning dynamics, though it is incremental as it applies existing network theory to a new context.
The study tackled the problem of understanding how Large Language Models acquire capabilities during training by modeling a Transformer as a complex network, revealing that the network's structure evolves through distinct phases like exploration and consolidation, with the emergence of stable hierarchies and dynamic component roles.
The process by which Large Language Models (LLMs) acquire complex capabilities during training remains a key open question in mechanistic interpretability. This project investigates whether these learning dynamics can be characterized through the lens of Complex Network Theory (CNT). I introduce a novel methodology to represent a Transformer-based LLM as a directed, weighted graph where nodes are the model's computational components (attention heads and MLPs) and edges represent causal influence, measured via an intervention-based ablation technique. By tracking the evolution of this component-graph across 143 training checkpoints of the Pythia-14M model on a canonical induction task, I analyze a suite of graph-theoretic metrics. The results reveal that the network's structure evolves through distinct phases of exploration, consolidation, and refinement. Specifically, I identify the emergence of a stable hierarchy of information spreader components and a dynamic set of information gatherer components, whose roles reconfigure at key learning junctures. This work demonstrates that a component-level network perspective offers a powerful macroscopic lens for visualizing and understanding the self-organizing principles that drive the formation of functional circuits in LLMs.