ML LGDec 6, 2020

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

arXiv:2012.03224v114.715 citations

Originality Highly original

AI Analysis

This work provides a theoretical explanation for why deep learning can surpass shallow learning methods, addressing a fundamental question for deep learning researchers.

This paper theoretically analyzes the excess risk of a deep learning estimator trained with noisy gradient descent on a mildly overparameterized neural network. It demonstrates that deep learning can achieve a faster learning rate than O(1/sqrt(n)) and provably outperform various linear estimators, including kernel methods, especially in high-dimensional settings.

Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, $k$-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than $O(1/\sqrt{n})$ that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.

View on arXiv PDF

Similar