On the Algorithmic Stability and Generalization of Adaptive Optimization Methods
This addresses a foundational gap in machine learning theory for researchers and practitioners using adaptive optimization methods, though it is incremental in providing new theoretical insights.
The paper tackles the problem of understanding the theoretical properties of adaptive optimizers like Adagrad and Adam by developing a novel framework to study their stability and generalization, showing provable guarantees dependent on a parameter β₂ with empirical support.
Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend heavily on a single parameter $β_2$. Our empirical experiments support our claims and provide practical insights into the stability and generalization properties of adaptive optimization methods.