GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters
This addresses the challenge of efficient adaptation of large language models to downstream tasks while mitigating catastrophic forgetting, though it is incremental as it builds on existing sparse fine-tuning methods.
The paper tackles the problem of optimally selecting parameters for sparse fine-tuning of LLMs by introducing GaLLoP, which fine-tunes parameters with large gradient magnitudes and small pre-trained magnitudes, resulting in improved or matched performance compared to leading techniques like LoRA and DoRA on models such as LLaMA3 8B and Gemma 2B.
Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.