LGJun 28, 2023

Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time

Stanford
arXiv:2306.16361v219 citationsh-index: 72
Originality Highly original
AI Analysis

This provides theoretical evidence that vanilla gradient descent can surpass kernel methods for neural networks, addressing a fundamental open question in non-convex optimization.

The paper tackles whether unmodified gradient descent on neural networks can outperform kernel methods in sample complexity, proving that projected gradient flow on polynomial-width two-layer networks achieves non-trivial error with O(d^3.1) samples, which kernel methods cannot match with n << d^4 samples.

Despite recent theoretical progress on the non-convex optimization of two-layer neural networks, it is still an open question whether gradient descent on neural networks without unnatural modifications can achieve better sample complexity than kernel methods. This paper provides a clean mean-field analysis of projected gradient flow on polynomial-width two-layer neural networks. Different from prior works, our analysis does not require unnatural modifications of the optimization algorithm. We prove that with sample size $n = O(d^{3.1})$ where $d$ is the dimension of the inputs, the network trained with projected gradient flow converges in $\text{poly}(d)$ time to a non-trivial error that is not achievable by kernel methods using $n \ll d^4$ samples, hence demonstrating a clear separation between unmodified gradient descent and NTK. As a corollary, we show that projected gradient descent with a positive learning rate and a polynomial number of iterations converges to low error with the same sample complexity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes