LGFeb 29, 2024

Batch size invariant Adam

arXiv:2402.18824v14 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the challenge of stable distributed training for machine learning practitioners, though it is incremental as it builds on existing Adam optimization.

The authors tackled the problem of batch size sensitivity in Adam optimizer for distributed training by proposing a batch size invariant version that squares micro-batch gradients before averaging, eliminating the need for strong assumptions required by prior methods. They confirmed that this approach achieves batch size invariance in a wider range of practical scenarios.

We propose a batch size invariant version of Adam, for use in large-scale, distributed settings, in which the mini-batch is divided into micro-batches which are distributed among worker nodes. For the v term, standard Adam first computes the average over micro-batch gradients, then squares, while in the batch size invariant Adam proposed here, we first square the micro-batch gradients, then average. Previous work (e.g. Malladi et al. 2022) used an alternative approach that involved a square-root scaling of the learning rate, but this approach requires strong assumptions to work; in particular that the gradient variance dominates the square of the expected gradient. In contrast, the approach proposed here gives batch size invariance without this assumption. We confirm that in practice our scheme gives batch size invariance in a much larger range of scenarios than the previous approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes