From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications
This work addresses the resource-intensive nature of LLMs for practitioners by providing a data-agnostic compression and fine-tuning method, though it is incremental as it builds on existing low-rank compression ideas with a new gradient dynamics perspective.
The paper tackles the problem of compressing Large Language Models (LLMs) by analyzing the low-rank properties of weight matrices through gradient subspace stabilization, revealing that different components require variable rank reduction to minimize performance loss. It presents WeLore, a method that unifies compression and memory-efficient fine-tuning, achieving results that closely mimic or outperform full fine-tuning with reduced memory and compute requirements.
Large Language Models' (LLMs) weight matrices can often be expressed in low-rank form with potential to relax memory and compute resource requirements. Unlike prior efforts that focus on developing novel matrix decompositions, in this work we study the non-uniform low-rank properties of weight matrices in LLMs through the lens of stabilizing gradient subspace. First, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Second, we empirically establish an important relationship between gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structures, necessitating variable rank reduction across them to minimize drop in performance due to compression. Drawing on this result, we present Weight Low-Rank Projection(WeLore) that unifies weight compression and memory-efficient fine-tuning into one, in a data-agnostic and one-shot manner. When used as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) and suitably encodes them for minimum performance loss. Our gradient dynamics perspective illustrates that LRCs tend to have better fine-tuning capabilities and their standalone fine-tuning can closely mimic and sometimes outperform the training loss trajectory and performance of full fine-tuning with notable memory and compute footprint reduction. Codes are available at https://github.com/VITA-Group/WeLore.