Optimal Neural Network Approximation for High-Dimensional Continuous Functions
This work addresses the challenge of parameter efficiency in neural network approximation for high-dimensional continuous functions, which is incremental as it builds on prior methods to improve bounds.
The paper tackles the problem of reducing the number of neurons or parameters in neural networks for approximating high-dimensional continuous functions, achieving a network with at most 10889d + 10887 nonzero parameters that attains super approximation, and shows that at least d parameters are required, indicating linear growth is optimal.
Recently, the authors of \cite{SYZ22} developed a neural network with width $36d(2d + 1)$ and depth $11$, which utilizes a special activation function called the elementary universal activation function, to achieve the super approximation property for functions in $C([a,b]^d)$. That is, the constructed network only requires a fixed number of neurons (and thus parameters) to approximate a $d$-variate continuous function on a $d$-dimensional hypercube with arbitrary accuracy. More specifically, only $\mathcal{O}(d^2)$ neurons or parameters are used. One natural question is whether we can reduce the number of these neurons or parameters in such a network. By leveraging a variant of the Kolmogorov Superposition Theorem, \textcolor{black}{we show that there is a composition of networks generated by the elementary universal activation function with at most $10889d + 10887$ nonzero parameters such that this super approximation property is attained. The composed network consists of repeated evaluations of two neural networks: one with width $36(2d+1)$ and the other with width 36, both having 5 layers.} Furthermore, we present a family of continuous functions that requires at least width $d$, and thus at least $d$ neurons or parameters, to achieve arbitrary accuracy in its approximation. This suggests that the number of nonzero parameters is optimal in the sense that it grows linearly with the input dimension $d$, unlike some approximation methods where parameters may grow exponentially with $d$.