LGAIJul 1, 2025

Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW

arXiv:2507.01241v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses optimization bottlenecks for researchers and practitioners training LLMs, though it appears incremental as it builds on existing first-order methods.

The paper tackles the performance limitations of stochastic gradient descent (SGD) in training large language models (LLMs) by proposing a stochastic conjugate subgradient method with adaptive sampling, which achieves faster convergence per iteration and improved scalability compared to traditional SGD techniques.

Stochastic gradient-based descent (SGD), have long been central to training large language models (LLMs). However, their effectiveness is increasingly being questioned, particularly in large-scale applications where empirical evidence suggests potential performance limitations. In response, this paper proposes a stochastic conjugate subgradient method together with adaptive sampling tailored specifically for training LLMs. The method not only achieves faster convergence per iteration but also demonstrates improved scalability compared to traditional SGD techniques. It leverages sample complexity analysis to adaptively choose the sample size, employs a stochastic conjugate subgradient approach to determine search directions and utilizing an AdamW-like algorithm to adaptively adjust step sizes. This approach preserves the key advantages of first-order methods while effectively addressing the nonconvexity and non-smoothness inherent in LLMs training. Additionally, we provide a detailed analysis of the advantage of the algorithm. Experimental results show that the proposed method not only maintains, but in many cases surpasses, the scalability of traditional SGD techniques, significantly enhancing both the speed and accuracy of the optimization process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes