LG NE OC MLMay 2, 2021

Universal scaling laws in the gradient descent training of neural networks

arXiv:2105.00507v18.410 citations

Originality Incremental advance

AI Analysis

This provides a universal scaling law for understanding optimization trajectories in neural networks, which is incremental but broadens theoretical insights beyond loose bounds.

The authors derived an explicit asymptotic power law for the loss decay in gradient descent training of neural networks, showing that the exponent depends only on data dimension, activation smoothness, and function class, without requiring specific data distributions.

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law $L(t) \sim t^{-ξ}$ with exponent $ξ$ expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require specific form of a data distribution, for example Gaussian, thus making our findings sufficiently universal.

View on arXiv PDF

Similar