LG AIApr 29, 2025

GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao

arXiv:2504.20437v121.312 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses memory efficiency for researchers and practitioners training large language models, representing an incremental improvement over prior methods.

The paper tackles memory bottlenecks in large language model pre-training by introducing GaLore 2, an efficient framework that addresses computational overhead and integration challenges, enabling pre-training of Llama 7B with up to 500 billion tokens.

Large language models (LLMs) have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.

View on arXiv PDF

Similar