LGMLMay 13, 2023

Depth Dependence of $μ$P Learning Rates in ReLU MLPs

arXiv:2305.07810v110 citations
Originality Incremental advance
AI Analysis

This provides theoretical insights into training dynamics for deep neural networks, but is incremental as it builds on prior work on μP.

The paper investigates how the maximal update learning rate in ReLU multilayer perceptrons depends on network depth, finding it scales as L^{-3/2} while being largely independent of width.

In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. As in prior work on $μ$P of Yang et. al., we find that this maximal update learning rate is independent of $n$ for all but the first and last layer weights. However, we find that it has a non-trivial dependence of $L$, scaling like $L^{-3/2}.$

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes