LGApr 9

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

arXiv:2604.0766356.6
AI Analysis

This addresses a critical memory efficiency problem for large language model training, offering a practical improvement over existing methods.

The paper tackles the memory bottleneck of AdamW in LLM pretraining by proposing SAGE, a novel optimizer that resolves the embedding layer dilemma in hybrid designs, achieving new state-of-the-art perplexity on Llama models up to 1.3B parameters while reducing optimizer state memory.

The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes