Neural Networks Learning and Memorization with (almost) no Over-Parameterization
This addresses the issue of over-parameterization in neural network learning for researchers and practitioners, providing theoretical guarantees for efficient learning with minimal network size, though it is incremental as it builds on prior work on memorization and polynomial learning.
The paper tackles the problem of neural networks requiring excessive parameters for learning non-linear models, showing that SGD on depth two networks can memorize samples and learn polynomials with near-optimal network size, sample complexity, and runtime, specifically memorizing m random points with O(m/d) hidden neurons.
Many results in recent years established polynomial time learnability of various models via neural networks algorithms. However, unless the model is linear separable, or the activation is a polynomial, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with $\tilde{O}\left(\frac{m}{d}\right)$ hidden neurons (and hence $\tilde{O}(m)$ parameters) can memorize $m$ random labeled points in $\mathbb{S}^{d-1}$.