Tiberiu Musat

LG
h-index40
5papers
15citations
Novelty65%
AI Score52

5 Papers

AINov 2, 2025
On the Emergence of Induction Heads for In-Context Learning

Tiberiu Musat, Tiago Pimentel, Lorenzo Noci et al.

Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

LGMay 11
Neural Weight Norm = Kolmogorov Complexity

Tiberiu Musat

Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.

LGAug 18, 2024
Clustering and Alignment: Understanding the Training Dynamics in Modular Addition

Tiberiu Musat

Recent studies have revealed that neural networks learn interpretable algorithms for many simple problems. However, little is known about how these algorithms emerge during training. In this article, I study the training dynamics of a small neural network with 2-dimensional embeddings on the problem of modular addition. I observe that embedding vectors tend to organize into two types of structures: grids and circles. I study these structures and explain their emergence as a result of two simple tendencies exhibited by pairs of embeddings: clustering and alignment. I propose explicit formulae for these tendencies as interaction forces between different pairs of embeddings. To show that my formulae can fully account for the emergence of these structures, I construct an equivalent particle simulation where I show that identical structures emerge. I discuss the role of weight decay in my setup and reveal a new mechanism that links regularization and training dynamics. To support my findings, I also release an interactive demo available at https://modular-addition.vercel.app/.

LGNov 2, 2025
The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Tiberiu Musat

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

LGNov 18, 2024
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Tiberiu Musat

In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.