Approximation Error Upper and Lower Bounds for Hölder Class with Transformers
Provides the first rigorous lower bound for Transformer approximation, offering theoretical insights into the minimal network size needed for function approximation, which is important for understanding Transformer efficiency.
The paper establishes precise approximation error upper and lower bounds for Hölder class functions using Transformers, proving that a Transformer network requires at most O(ε^{-d0/α}) blocks and at least Ω(ε^{-d0/(4α)}) blocks to achieve ε accuracy.
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/α})$ blocks can approximate any bounded Hölder function with $d_{0}$-dimensional input and smoothness $α\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $Ω(\varepsilon^{-{d_{0}}/({4α})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.