LG OCJun 27, 2022

Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

arXiv:2206.13290v17.87 citationsh-index: 25

Originality Incremental advance

AI Analysis

This work addresses the unrealistic Lipschitz smoothness condition in optimization theory for practitioners using Adam in deep learning, providing theoretical justification for common hyperparameter choices.

The paper tackles the gap between theoretical analyses of Adam, which assume Lipschitz smoothness, and practical implementations that use small constant learning rates and hyperparameters close to one, showing that Adam performs well under these conditions without Lipschitz assumptions, with findings including good performance with large batch sizes and diminishing learning rates.

Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters ($β_1$ and $β_2$) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness condition in order to bridge the gap between theory and practice. The main contribution is to show theoretical evidence that Adam using small learning rates and hyperparameters close to one performs well, whereas the previous theoretical results were all for hyperparameters close to zero. Our analysis also leads to the finding that Adam performs well with large batch sizes. Moreover, we show that Adam performs well when it uses diminishing learning rates and hyperparameters close to one.

View on arXiv PDF

Similar