STAT-MECH DIS-NN LG NC MLSep 18, 2025

Data coarse graining can improve model performance

Alex Nguyen, David J. Schwab, Vudtiwat Ngampruetikorn

arXiv:2509.14498v11 citationsh-index: 10

Originality Incremental advance

AI Analysis

This provides a principled explanation for why data augmentation works, benefiting researchers in machine learning theory and practice, though it is incremental as it builds on existing ideas from statistical physics.

The paper tackles the paradox of how lossy data transformations can improve generalization in machine learning by analyzing a solvable model of high-dimensional linear regression under data coarse-graining. It finds that a 'high-pass' scheme filtering out less relevant features can reduce prediction risk nonmonotonically, with optimal regularization showing this effect is distinct from double descent.

Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under 'data coarse graining.' Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A 'high-pass' scheme--which filters out less relevant, lower-signal features--can help models generalize better. By contrast, a 'low-pass' scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.

View on arXiv PDF

Similar