Transformer Dynamics: A neuroscientific approach to interpretability of large language models
This work addresses the interpretability problem for AI researchers and practitioners by bridging dynamical systems theory with mechanistic interpretability, though it is incremental in applying neuroscience-inspired methods to AI.
The paper tackled the challenge of understanding internal mechanisms in large language models by proposing a novel framework that conceptualizes the residual stream in transformers as a dynamical system, revealing that activations exhibit continuity, acceleration, and attractor-like dynamics across layers.
As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.