LGMLFeb 7, 2019

Combining learning rate decay and weight decay with complexity gradient descent - Part I

arXiv:1902.02881v14 citations
Originality Incremental advance
AI Analysis

This addresses optimization challenges in deep learning, offering incremental improvements for training efficiency.

The paper tackles the unclear role of L2 regularization in deep neural networks by introducing a 'complexity' metric based on loss level and nonconvexity, leading to novel annealing schemes for regularization strength during training.

The role of $L^2$ regularization, in the specific case of deep neural networks rather than more traditional machine learning models, is still not fully elucidated. We hypothesize that this complex interplay is due to the combination of overparameterization and high dimensional phenomena that take place during training and make it unamenable to standard convex optimization methods. Using insights from statistical physics and random fields theory, we introduce a parameter factoring in both the level of the loss function and its remaining nonconvexity: the \emph{complexity}. We proceed to show that it is desirable to proceed with \emph{complexity gradient descent}. We then show how to use this intuition to derive novel and efficient annealing schemes for the strength of $L^2$ regularization when performing standard stochastic gradient descent in deep neural networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes