LGCLMLFeb 4, 2025

Avoiding spurious sharpness minimization broadens applicability of SAM

ETH Zurich
arXiv:2502.02407v18 citationsh-index: 11ICML
AI Analysis

This work addresses the limited applicability of curvature regularization techniques in NLP, offering incremental improvements for training large language models.

The paper tackled the poor performance of Sharpness Aware Minimization (SAM) in NLP tasks by identifying that it regularizes logit statistics instead of function geometry, leading to the development of Functional-SAM and preconditioning methods that improve performance over baselines like AdamW and SAM across various model scales, including billion-parameter LLMs.

Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance -- even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics -- instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional-SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional-SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes