Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
This provides a theoretical foundation for the statistical efficiency of attention mechanisms in machine learning, showing dimension-free rates even under non-identifiability, which is incremental but clarifies fundamental properties.
The paper tackles the problem of learning pairwise interactions in single-layer attention-style models, proving a minimax convergence rate of M^{-2β/(2β+1)} that depends only on the smoothness of the activation and is independent of token count, dimension, or matrix rank.
We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2β}{2β+1}}$ with $M$ being the sample size, depending only on the smoothness $β$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.