LG MLSep 8, 2022

Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Maxim Kodryan, Ekaterina Lobacheva, Maksim Nakhodnov, Dmitry Vetrov

arXiv:2209.03695v313.620 citationsh-index: 32Has Code

Originality Incremental advance

AI Analysis

This work addresses optimization challenges in deep learning for researchers, offering incremental insights into loss landscape dynamics for normalized networks.

The paper investigates training scale-invariant neural networks directly on a sphere with a fixed effective learning rate, identifying three regimes—convergence, chaotic equilibrium, and divergence—and analyzes their properties through theoretical and empirical methods. It shows how these regimes relate to conventional normalized network training and can be used to find better optima.

A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.

View on arXiv PDF Code

Similar