LGMLApr 22, 2020

A Neural Scaling Law from the Dimension of the Data Manifold

arXiv:2004.10802v166 citations
Originality Highly original
AI Analysis

This work provides a theoretical foundation for neural scaling laws, which is foundational for understanding and optimizing large-scale machine learning models across domains like vision and language.

The paper tackles the problem of explaining the power-law scaling of neural network loss with model size by proposing that it arises from regression on a data manifold of intrinsic dimension d, predicting exponents α ≈ 4/d. They confirm this theory through experiments with teacher/student frameworks, CNNs on image datasets, and GPT-type language models, showing alignment between measured dimensions and scaling exponents.

When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-α}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $α\approx 4/d$ for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of $d$ and $α$ by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes