Statistical Advantage of Softmax Attention: Insights from Single-Location Regression
This work provides theoretical insights into attention mechanisms for researchers in machine learning, though it is incremental as it builds on existing statistical physics frameworks.
The paper tackled the problem of understanding why softmax attention outperforms alternatives in large language models by analyzing a single-location regression task, showing that softmax achieves Bayes risk at the population level and consistently beats linear attention in finite samples.
Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the easier-to-analyze linearized attention. In this work, we address this gap through a principled study of the single-location regression task, where the output depends on a linear transformation of a single input token at a random location. Building on ideas from statistical physics, we develop an analysis of attention-based predictors in the high-dimensional limit, where generalization performance is captured by a small set of order parameters. At the population level, we show that softmax achieves the Bayes risk, whereas linear attention fundamentally falls short. We then examine other activation functions to identify which properties are necessary for optimal performance. Finally, we analyze the finite-sample regime: we provide an asymptotic characterization of the test error and show that, while softmax is no longer Bayes-optimal, it consistently outperforms linear attention. We discuss the connection with optimization by gradient-based algorithms.