LGSTMLFeb 18, 2022

Gaussian and Non-Gaussian Universality of Data Augmentation

arXiv:2202.09134v5
Originality Incremental advance
AI Analysis

This provides theoretical insights into data augmentation for machine learning practitioners, but it is incremental as it builds on existing universality techniques.

The paper tackles the problem of understanding how data augmentation affects the uncertainty and distribution of estimates in machine learning, finding that it can increase uncertainty, act as a regularizer in some cases but not others, and shift double-descent peaks, depending on factors like data distribution and estimator properties.

We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As our main theoretical tool, we develop an adaptation of Lindeberg's technique for block dependence. The resulting universality regime may be Gaussian or non-Gaussian.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes