LGFeb 1, 2025

Sparse Gradient Compression for Fine-Tuning Large Language Models

David H. Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen

arXiv:2502.00311v111.44 citationsh-index: 13Has Code

Originality Incremental advance

AI Analysis

This addresses memory efficiency challenges for researchers and practitioners fine-tuning large models, offering a flexible trade-off, though it is incremental as it builds on parameter-efficient fine-tuning methods.

The paper tackles the high memory costs of fine-tuning large language models by proposing sparse gradient compression, which reduces optimizer state memory usage more effectively than existing methods, achieving superior performance on downstream tasks with substantial memory savings.

Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. However, the high memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. To address this, parameter efficient fine-tuning (PEFT) methods have been proposed to minimize the number of parameters required for fine-tuning LLMs. However, these approaches often tie the number of optimizer states to dimensions of model parameters, limiting flexibility and control during fine-tuning. In this paper, we propose sparse gradient compression (SGC), a training regime designed to address these limitations. Our approach leverages inherent sparsity in gradients to compress optimizer states by projecting them onto a low-dimensonal subspace, with dimensionality independent of the original model's parameters. By enabling optimizer state updates in an arbitrary low-dimensional subspace, SGC offers a flexible tradeoff between memory efficiency and performance. We demonstrate through experiments that SGC can decrease memory usage in optimizer states more effectively than existing PEFT methods. Furthermore, by fine-tuning LLMs on various downstream tasks, we show that SGC can deliver superior performance while substantially lowering optimizer state memory requirements, particularly in both data-limited and memory-limited settings.

View on arXiv PDF

Similar