MLLGApr 16, 2025

Approximation Bounds for Transformer Networks with Application to Regression

arXiv:2504.12175v15 citationsh-index: 7
Originality Incremental advance
AI Analysis

This provides theoretical guarantees for Transformers in function approximation and regression, addressing a gap for practitioners in machine learning, though it is incremental relative to existing neural network theory.

The paper establishes approximation bounds for Transformer networks on Hölder and Sobolev functions, showing that to achieve error ε, a fixed-depth Transformer requires parameters scaling as ε^(-d_x n/γ), matching known bounds for FNNs and RNNs. It also derives convergence rates for nonparametric regression with dependent data, imposing no weight constraints.

We explore the approximation capabilities of Transformer networks for Hölder and Sobolev functions, and apply these results to address nonparametric regression estimation with dependent observations. First, we establish novel upper bounds for standard Transformer networks approximating sequence-to-sequence mappings whose component functions are Hölder continuous with smoothness index $γ\in (0,1]$. To achieve an approximation error $\varepsilon$ under the $L^p$-norm for $p \in [1, \infty]$, it suffices to use a fixed-depth Transformer network whose total number of parameters scales as $\varepsilon^{-d_x n / γ}$. This result not only extends existing findings to include the case $p = \infty$, but also matches the best known upper bounds on number of parameters previously obtained for fixed-depth FNNs and RNNs. Similar bounds are also derived for Sobolev functions. Second, we derive explicit convergence rates for the nonparametric regression problem under various $β$-mixing data assumptions, which allow the dependence between observations to weaken over time. Our bounds on the sample complexity impose no constraints on weight magnitudes. Lastly, we propose a novel proof strategy to establish approximation bounds, inspired by the Kolmogorov-Arnold representation theorem. We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence Hölder functions, offering new insights into the interpretability of self-attention mechanisms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes