Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
This provides insights into how transformers learn reasoning mechanisms, which is important for understanding and improving AI models, though it is incremental as it builds on existing transformer analysis.
The paper tackles the retrieval problem, a reasoning task requiring transformers with a minimum number of layers that scales logarithmically with input size, and shows that large language models can solve it without fine-tuning, while training reveals attention heads emerge in a specific sequence guided by an implicit curriculum.
In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.