Neural Scaling Laws Rooted in the Data Distribution
This work provides a foundational theoretical framework for understanding scaling laws in AI, potentially impacting all of ML/AI by grounding empirical observations in data distribution properties.
The paper tackled the problem of explaining the universal neural scaling laws observed in deep learning by developing a mathematical model based on percolation theory, which identifies two criticality regimes that yield optimal power-law scaling and unifies prior theories, with validation on toy datasets.
Deep neural networks exhibit empirical neural scaling laws, with error decreasing as a power law with increasing model or data size, across a wide variety of architectures, tasks, and datasets. This universality suggests that scaling laws may result from general properties of natural learning tasks. We develop a mathematical model intended to describe natural datasets using percolation theory. Two distinct criticality regimes emerge, each yielding optimal power-law neural scaling laws. These regimes, corresponding to power-law-distributed discrete subtasks and a dominant data manifold, can be associated with previously proposed theories of neural scaling, thereby grounding and unifying prior works. We test the theory by training regression models on toy datasets derived from percolation theory simulations. We suggest directions for quantitatively predicting language model scaling.