LGAIJun 1, 2023

On the Weight Dynamics of Deep Normalized Networks

arXiv:2306.00700v34 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses trainability issues in deep normalized networks, particularly for very deep networks, but it is incremental as it builds on existing studies of learning rate disparities.

The paper tackled the problem of high disparities in effective learning rates across layers in deep neural networks, which can harm trainability, by modeling weight dynamics and proving that these disparities converge to 1 with constant learning rates, and they validated this with a hyper-parameter-free warm-up method that minimized ELR spread effectively.

Recent studies have shown that high disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics (evolution of expected gradient and weight norms) of networks with normalization layers, predicting the evolution of layer-wise ELR ratios. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion. We identify a ``critical learning rate" beyond which ELR disparities widen, which only depends on current ELRs. To validate our findings, we devise a hyper-parameter-free warm-up method that successfully minimizes ELR spread quickly in theory and practice. Our experiments link ELR spread with trainability, a relationship that is most evident in very deep networks with significant gradient magnitude excursions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes