LGOct 29, 2021

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit

arXiv:2110.15596v27 citations
AI Analysis

This work addresses theoretical understanding of deep learning dynamics for researchers, but it is incremental as it builds on existing large-width asymptotics and focuses on specific parameterizations.

The paper tackles the challenge of analyzing gradient dynamics in deep neural networks by studying integrable parameterizations (IPs) in the infinite-width limit, revealing that networks with more than four layers start at a stationary point and no learning occurs under standard initialization, but proposes methods like large initial learning rates to avoid this, with numerical experiments on image classification tasks confirming the results and highlighting differences in activation functions.

To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of ''small'' initialization corresponding to "mean-field" limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization $μ$P. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes