LGMLMay 26, 2019

On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

arXiv:1905.10826v261 citations
Originality Incremental advance
AI Analysis

This provides theoretical insights into the generalization of over-parameterized neural networks, addressing a foundational problem in machine learning, though it is incremental as it builds on prior work in this area.

The paper tackles the problem of understanding why over-parameterized neural networks trained with gradient descent generalize well, showing that when the network is sufficiently wide, the empirical risk decreases to a low-rank approximation error at a linear rate independent of sample size, and if the target function has zero approximation error, the risk reduces to Θ(1/√n) with nearly-linear over-parameterization.

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\em an} integral operator which is determined by the feature vector distribution $ρ$ only. Consequently, GD method can be viewed as {\em approximately} applying the powers of this integral operator on the underlying/target function $f^*$ that generates the responses/labels. We show that if $f^*$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low-rank approximation error at a linear rate which is determined by $f^*$ and $ρ$ only, i.e., the rate is independent of the sample size $n$. Furthermore, if $f^*$ has zero low-rank approximation error, then, as long as the width of the neural network is $Ω(n\log n)$, the empirical risk decreases to $Θ(1/\sqrt{n})$. To the best of our knowledge, this is the first result showing the sufficiency of nearly-linear network over-parameterization. We provide an application of our general results to the setting where $ρ$ is the uniform distribution on the spheres and $f^*$ is a polynomial. Throughout this paper, we consider the scenario where the input dimension $d$ is fixed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes