Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations
This work addresses a foundational issue in deep learning theory by identifying activation smoothness as a key mechanism for statistical optimality, potentially impacting all of ML/AI.
The paper tackled the problem of understanding the theoretical advantages of smooth activation functions in neural networks, proving that constant-depth networks with smooth activations achieve minimax-optimal approximation and estimation error rates for arbitrary smoothness, unlike non-smooth activations like ReLU which require depth growth.
Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space $W^{s,\infty}([0,1]^d)$ for arbitrary smoothness $s>0$. We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.