LG OC MLMay 29, 2025

The Rich and the Simple: On the Implicit Bias of Adam and SGD

Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, Mahdi Soltanolkotabi

arXiv:2505.24022v219.78 citationsh-index: 16

Originality Highly original

AI Analysis

This addresses the problem of understanding optimization algorithm biases for machine learning practitioners, revealing Adam's advantage in handling spurious correlations and distribution shifts, though it is incremental relative to prior work on simplicity bias.

The paper investigates the implicit bias of Adam versus SGD in training neural networks, showing that while SGD exhibits simplicity bias leading to suboptimal linear decision boundaries, Adam produces richer nonlinear boundaries closer to Bayes' optimal predictor, achieving higher test accuracy both in-distribution and under distribution shifts.

Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks (NNs) trained with SGD are known to exhibit simplicity bias -- a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. First, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU NNs on a binary classification task with Gaussian data. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes' optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. Next, to corroborate our theoretical findings, we present extensive empirical results showing that this property of Adam leads to superior generalization across various datasets with spurious correlations where NNs trained with SGD are known to show simplicity bias and do not generalize well under certain distributional shifts.

View on arXiv PDF

Similar