ML LGNov 24, 2023

Analysis of the expected $L_2$ error of an over-parametrized deep neural network estimate learned by gradient descent without regularization

arXiv:2311.14609v15.94 citationsh-index: 42

Originality Incremental advance

AI Analysis

This addresses the theoretical understanding of deep learning optimization by showing regularization is not essential for consistency, which is incremental but clarifies assumptions in neural network theory.

The paper demonstrates that over-parametrized deep neural networks trained via gradient descent without regularization can achieve universal consistency and convergence rates comparable to regularized methods, with rates such as approximately n^{-1/(1+d)} for Hölder smooth functions and dimension-independent rates for interaction models.

Recent results show that estimates defined by over-parametrized deep neural networks learned by applying gradient descent to a regularized empirical $L_2$ risk are universally consistent and achieve good rates of convergence. In this paper, we show that the regularization term is not necessary to obtain similar results. In the case of a suitably chosen initialization of the network, a suitable number of gradient descent steps, and a suitable step size we show that an estimate without a regularization term is universally consistent for bounded predictor variables. Additionally, we show that if the regression function is Hölder smooth with Hölder exponent $1/2 \leq p \leq 1$, the $L_2$ error converges to zero with a convergence rate of approximately $n^{-1/(1+d)}$. Furthermore, in case of an interaction model, where the regression function consists of a sum of Hölder smooth functions with $d^*$ components, a rate of convergence is derived which does not depend on the input dimension $d$.

View on arXiv PDF

Similar