LG MLJul 13, 2021

How many degrees of freedom do we need to train deep networks: a loss landscape perspective

Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli

arXiv:2107.05802v216.834 citationsHas Code

Originality Incremental advance

AI Analysis

This provides theoretical insights into training efficiency for deep learning practitioners, though it is incremental as it builds on existing work on pruning and random subspaces.

The paper analyzes how deep neural networks can be trained with far fewer degrees of freedom than total parameters, finding a sharp phase transition in success probability based on training dimension and explaining it via high-dimensional geometry of the loss landscape, with experiments showing threshold dimensions are a small fraction of parameters.

A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.

View on arXiv PDF Code

Similar