LGOCMLMay 30, 2019

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

arXiv:1905.13210v3430 citations
Originality Incremental advance
AI Analysis

This provides theoretical guarantees for generalization in deep learning, addressing a fundamental challenge for researchers and practitioners, though it builds incrementally on existing neural tangent kernel work.

The paper tackles the problem of generalization in over-parameterized deep neural networks by showing that the expected 0-1 loss for wide ReLU networks trained with SGD can be bounded by the training loss of a neural tangent random feature model, yielding a generalization error bound of order tilde{O}(n^{-1/2}) independent of network width for certain data distributions.

We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected $0$-$1$ loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a neural tangent random feature (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of $\tilde{\mathcal{O}}(n^{-1/2})$ that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes