LGAIMLOct 13, 2025

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

arXiv:2510.11354v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding generalization in stochastic Adam for deep learning practitioners, providing theoretical insights into batch size and weight decay tuning, though it is incremental in extending existing Adam analysis.

The paper tackles the gap between theoretical analysis of full-batch Adam and its practical stochastic variant, showing that mini-batch Adam can achieve near-zero test error in over-parameterized CNNs, unlike its full-batch counterpart which converges to poor solutions.

Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $λ$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $λ$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes