LG NA MLMar 30, 2022

Optimal Learning

Peter Binev, Andrea Bonito, Ronald DeVore, Guergana Petrova

arXiv:2203.15994v222.128 citations

Originality Incremental advance

AI Analysis

This provides a theoretical justification for over-parameterization in machine learning, addressing a foundational issue for researchers in optimization and learning theory.

The paper tackles the problem of learning an unknown function from data by showing that over-parameterized optimization with a penalty term yields a near-optimal approximation, with error bounded by a constant times the optimal error, and provides quantitative bounds on over-parameterization and penalization.

This paper studies the problem of learning an unknown function $f$ from given data about $f$. The learning problem is to give an approximation $\hat f$ to $f$ that predicts the values of $f$ away from the data. There are numerous settings for this learning problem depending on (i) what additional information we have about $f$ (known as a model class assumption), (ii) how we measure the accuracy of how well $\hat f$ predicts $f$, (iii) what is known about the data and data sites, (iv) whether the data observations are polluted by noise. A mathematical description of the optimal performance possible (the smallest possible error of recovery) is known in the presence of a model class assumption. Under standard model class assumptions, it is shown in this paper that a near optimal $\hat f$ can be found by solving a certain discrete over-parameterized optimization problem with a penalty term. Here, near optimal means that the error is bounded by a fixed constant times the optimal error. This explains the advantage of over-parameterization which is commonly used in modern machine learning. The main results of this paper prove that over-parameterized learning with an appropriate loss function gives a near optimal approximation $\hat f$ of the function $f$ from which the data is collected. Quantitative bounds are given for how much over-parameterization needs to be employed and how the penalization needs to be scaled in order to guarantee a near optimal recovery of $f$. An extension of these results to the case where the data is polluted by additive deterministic noise is also given.

View on arXiv PDF

Similar