Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity
For machine learning theorists, this work establishes rigorous approximation and generalization guarantees for Transformers, bridging the gap between practice and theory.
This paper provides a theoretical analysis of Transformer networks for regression, proving that a shallow, wide Transformer with two encoder blocks can uniformly approximate α-Hölder continuous functions with ε error using O(ε^{-d/α}) parameters, and achieves near minimax-optimal generalization error of O(n^{-2α/(2α+d)} log n).
This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $α$-Hölder continuous functions with $α\in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/α})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2α}{2α+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.