LGOCMLMay 30, 2025

GradPower: Powering Gradients for Faster Language Model Pre-Training

arXiv:2505.24275v15 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck in pre-training large language models, offering a lightweight, easy-to-implement solution for faster and more efficient training.

The paper tackles the problem of slow language model pre-training by introducing GradPower, a gradient-transformation technique that accelerates training and consistently achieves lower terminal loss across diverse architectures, datasets, and learning-rate schedules, with notable gains in mixture-of-experts models.

We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlights the influence of gradient noise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes