Optimal Approximation Rate of ReLU Networks in terms of Width and Depth
This work provides foundational theoretical insights into neural network design, addressing a core problem in machine learning for researchers and practitioners by establishing optimal bounds on approximation capabilities.
This paper tackles the problem of approximating continuous functions using ReLU neural networks by deriving optimal approximation rates in terms of width and depth, achieving rates such as O(λ√d (N^2L^2 ln N)^{-α/d}) for Hölder continuous functions, which are shown to be optimal up to a constant factor.
This paper concentrates on the approximation power of deep feed-forward neural networks in terms of width and depth. It is proved by construction that ReLU networks with width $\mathcal{O}\big(\max\{d\lfloor N^{1/d}\rfloor,\, N+2\}\big)$ and depth $\mathcal{O}(L)$ can approximate a Hölder continuous function on $[0,1]^d$ with an approximation rate $\mathcal{O}\big(λ\sqrt{d} (N^2L^2\ln N)^{-α/d}\big)$, where $α\in (0,1]$ and $λ>0$ are Hölder order and constant, respectively. Such a rate is optimal up to a constant in terms of width and depth separately, while existing results are only nearly optimal without the logarithmic factor in the approximation rate. More generally, for an arbitrary continuous function $f$ on $[0,1]^d$, the approximation rate becomes $\mathcal{O}\big(\,\sqrt{d}\,ω_f\big( (N^2L^2\ln N)^{-1/d}\big)\,\big)$, where $ω_f(\cdot)$ is the modulus of continuity. We also extend our analysis to any continuous function $f$ on a bounded set. Particularly, if ReLU networks with depth $31$ and width $\mathcal{O}(N)$ are used to approximate one-dimensional Lipschitz continuous functions on $[0,1]$ with a Lipschitz constant $λ>0$, the approximation rate in terms of the total number of parameters, $W=\mathcal{O}(N^2)$, becomes $\mathcal{O}(\tfracλ{W\ln W})$, which has not been discovered in the literature for fixed-depth ReLU networks.