LG DIS-NN IT MLJun 2, 2025

Bayes optimal learning of attention-indexed models

Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová

arXiv:2506.01582v213.04 citationsh-index: 40Has Code

Originality Highly original

AI Analysis

This provides a solvable theoretical framework for understanding learning in self-attention layers, which are key components of modern transformer architectures, though it appears incremental as it builds on prior tractable attention models.

The paper tackles the problem of theoretically analyzing learning in deep attention layers by introducing the attention-indexed model (AIM), which captures token-level outputs from layered bilinear interactions and aligns closely with practical transformers. The result includes deriving closed-form predictions for Bayes-optimal generalization error, identifying sharp phase transitions based on sample complexity, model width, and sequence length, and showing that gradient descent can achieve optimal performance.

We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.

View on arXiv PDF Code

Similar