AdaGrad under Anisotropic Smoothness
This work addresses a theoretical problem for researchers and practitioners in machine learning, providing incremental insights into adaptive optimization methods.
The paper tackles the theoretical gap in understanding the advantages of adaptive gradient methods like AdaGrad over uniform step-size methods in large batch settings, showing that under anisotropic smoothness and noise conditions, AdaGrad achieves faster convergence with better dimensional dependence, supported by experiments in logistic regression and instruction-following fine-tuning tasks.
Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis.