LGMar 6

Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng

arXiv:2603.05960v17.3h-index: 15

Predicted impact top 61% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the GPU-memory bottleneck in scaling full-parameter training of large language models, representing an incremental improvement over existing methods.

The paper tackles the problem of memory-efficient optimization for large language model training by proposing Omni-Masked Gradient Descent (OMGD), which achieves a strictly improved iteration complexity of $ ilde{\mathcal{O}}(\epsilon^{-3})$ for finding an $\epsilon$-approximate stationary point in nonconvex settings.

Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(ε^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(ε^{-3})$ for finding an $ε$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.

View on arXiv PDF

Similar