EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs
This work addresses the challenge of efficiently fine-tuning sparse LLMs for researchers and practitioners, offering a method that reduces computational costs and improves performance, though it is incremental as it builds on existing fine-tuning and sparsity techniques.
The paper tackles the problem of resource-intensive and suboptimal fine-tuning for sparse large language models (LLMs) by proposing EBFT, an efficient and fast framework that minimizes reconstruction error block-wise, achieving a perplexity of 16.88 on Wikitext2 with LlamaV1-7B at 70% sparsity, outperforming baselines like DSnoT (75.14) and LoRA (16.44), and completing fine-tuning in about 30 minutes on a single 16GB GPU.
Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. Additionally, many fine-tuning methods often rely on approximations or heuristic optimization strategies, which may lead to suboptimal solutions. To address these issues, we propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error, on a block-by-block basis, aiming for optimal solutions. Extensive experiments on various benchmarks consistently demonstrate the superiority of our method over other baselines. For instance, on the Wikitext2 dataset with LlamaV1-7B at 70% sparsity, our proposed EBFT achieves a perplexity of 16.88, surpassing the state-of-the-art DSnoT with a perplexity of 75.14. Moreover, with a structured sparsity ratio of 26\%, EBFT achieves a perplexity of 16.27, outperforming LoRA (perplexity 16.44). Furthermore, the fine-tuning process of EBFT for LlamaV1-7B only takes approximately 30 minutes, and the entire framework can be executed on a single 16GB GPU. The source code is available at https://github.com/sunggo/EBFT.