GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection
This work addresses a bottleneck in low-rank adaptation methods for LLMs, offering faster training for researchers and practitioners, though it is incremental as it builds directly on GaLore.
The paper tackled the high time consumption of low-rank projection estimation in GaLore for optimizing large language models (LLMs), proposing GaLore+ with cross-head projection and randomized SVD to achieve approximately 4x fine-tuning speed compared to vanilla GaLore while maintaining superior performance on arithmetic reasoning and natural language generation tasks.
Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80\% of the total training time. To address this issue, we propose GaLore$+$, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore$+$ on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore$+$ delivers superior performance while achieving approximately $4\times$ fine-tuning speed compared to vanilla GaLore.