LGAIMLFeb 6, 2024

Scaling laws for learning with real and surrogate data

arXiv:2402.04376v326 citationsh-index: 19NIPS
Originality Incremental advance
AI Analysis

This addresses the bottleneck of expensive or impractical high-quality data collection for machine learning practitioners, offering a method to leverage more accessible surrogate data, though it is incremental as it builds on classical statistical models.

The paper tackles the problem of data scarcity in machine learning by proposing a weighted empirical risk minimization method to integrate surrogate data, showing that it can significantly reduce test error on the original distribution, with scaling laws predicting optimal weighting and data amounts.

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes