LG NE MLJun 8, 2020

The Golden Ratio of Learning and Momentum

arXiv:2006.04751v16 citations

Originality Incremental advance

AI Analysis

This work addresses a fundamental optimization problem in neural network training, offering a theoretical basis for parameters often set empirically, though it appears incremental as it builds on existing backpropagation methods.

The paper tackles the empirical selection of learning rate and momentum in backpropagation by proposing a new information-theoretical loss function derived from neural signal processing, which implies specific values for these parameters and shows practical utility in handwritten digit recognition experiments.

Gradient descent has been a central training principle for artificial neural networks from the early beginnings to today's deep learning networks. The most common implementation is the backpropagation algorithm for training feed-forward neural networks in a supervised fashion. Backpropagation involves computing the gradient of a loss function, with respect to the weights of the network, to update the weights and thus minimize loss. Although the mean square error is often used as a loss function, the general stochastic gradient descent principle does not immediately connect with a specific loss function. Another drawback of backpropagation has been the search for optimal values of two important training parameters, learning rate and momentum weight, which are determined empirically in most systems. The learning rate specifies the step size towards a minimum of the loss function when following the gradient, while the momentum weight considers previous weight changes when updating current weights. Using both parameters in conjunction with each other is generally accepted as a means to improving training, although their specific values do not follow immediately from standard backpropagation theory. This paper proposes a new information-theoretical loss function motivated by neural signal processing in a synapse. The new loss function implies a specific learning rate and momentum weight, leading to empirical parameters often used in practice. The proposed framework also provides a more formal explanation of the momentum term and its smoothing effect on the training process. All results taken together show that loss, learning rate, and momentum are closely connected. To support these theoretical findings, experiments for handwritten digit recognition show the practical usefulness of the proposed loss function and training parameters.

View on arXiv PDF

Similar