The Effect of Mini-Batch Noise on the Implicit Bias of Adam

arXiv:2602.01642v12.72 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the problem of optimizing Adam hyperparameters for better generalization in multi-epoch training, particularly relevant for practitioners in deep learning, though it is incremental as it builds on existing understanding of implicit bias.

The paper investigates how mini-batch noise affects the implicit bias of Adam optimizer towards sharp or flat loss regions, finding that optimal momentum hyperparameters (β1, β2) depend on batch size, with (0.9, 0.999) best for small batches and adjusting β1 closer to β2 improving validation accuracy for larger batches.

With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(β_1, β_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $β_1$, $β_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $β_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $β_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $β_1$. In particular, the commonly "default" pair $(β_1, β_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $β_1$ closer to $β_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

View on arXiv PDF

Similar