SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training
This work addresses the need for democratizing LLMs by improving scalability for researchers and practitioners, though it appears incremental as it builds on prior subspace tracking methods.
The paper tackles the problem of resource-intensive training of large language models (LLMs) by proposing SubTrack++, which reduces pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing methods while maintaining the same memory footprint.
Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of optimizer states. While recent work has aimed to reduce memory consumption, such efforts often entail trade-offs among memory efficiency, training time, and model performance. Yet, true democratization of LLMs requires simultaneous progress across all three dimensions. To this end, we propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-aware optimizers, enabling Adam's internal statistics to adapt to subspace changes. Additionally, employing recovery scaling, a technique that restores information lost through low-rank projections, further enhances model performance. Our method demonstrates SOTA convergence by exploiting Grassmannian geometry, reducing pre-training wall-time by up to 65% and fine-tuning time by 36% compared to existing SOTA methods, while maintaining the same memory footprint.