Memorizing Gaussians with no over-parameterizaion via gradient decent on neural networks
This provides theoretical guarantees for memorization in neural networks without over-parameterization, which is important for understanding optimization and generalization in machine learning.
The paper tackles the problem of memorizing Gaussian data points with neural networks, proving that a single gradient descent step on a two-layer network with orthogonal initialization can memorize Ω(dq/log⁴(d)) independent randomly labeled Gaussians in ℝ^d.
We prove that a single step of gradient decent over depth two network, with $q$ hidden neurons, starting from orthogonal initialization, can memorize $Ω\left(\frac{dq}{\log^4(d)}\right)$ independent and randomly labeled Gaussians in $\mathbb{R}^d$. The result is valid for a large class of activation functions, which includes the absolute value.