LGApr 9

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

arXiv:2604.0766356.6

AI Analysis

This addresses a critical memory efficiency problem for large language model training, offering a practical improvement over existing methods.

The paper tackles the memory bottleneck of AdamW in LLM pretraining by proposing SAGE, a novel optimizer that resolves the embedding layer dilemma in hybrid designs, achieving new state-of-the-art perplexity on Llama models up to 1.3B parameters while reducing optimizer state memory.

The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.

View on arXiv PDF

Similar