STLGMLFeb 17, 2022

Universality of empirical risk minimization

arXiv:2202.08832v296 citations
AI Analysis

This provides foundational insights for theoretical machine learning by extending universality results to broader settings, including random features and neural tangent models, which is incremental but generalizes prior work.

The paper tackles the problem of understanding the universality of empirical risk minimization in supervised learning under proportional asymptotics, proving that both training and test errors depend only on the covariance structure of feature distributions, and can be approximated using Gaussian models without strong convexity or independence assumptions.

Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol θ}_1, . . . , {\boldsymbol θ}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = Θ(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes