Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning
This addresses memory inefficiency in fine-tuning LLMs for researchers and practitioners, offering a more effective alternative to existing zeroth-order methods, though it is incremental as it builds on MeZO and PEFT concepts.
The paper tackles the performance drop in memory-efficient zeroth-order fine-tuning of large language models by introducing Sparse MeZO, which applies optimization only to a subset of parameters, resulting in a 9% accuracy improvement and 3.5x speedup on the RTE task.
While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, the quality of gradient estimates in zeroth order optimization often depends on the data dimensionality, potentially explaining why MeZO still exhibits significant performance drops compared to standard fine-tuning across various tasks. Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet effective parameter selection scheme that yields significant performance gains with Sparse-MeZO. Additionally, we develop a memory-optimized implementation for sparse masking, ensuring the algorithm requires only inference-level memory consumption, allowing Sparse-MeZO to fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead. For example, it achieves a 9\% absolute accuracy improvement and 3.5x speedup over MeZO on the RTE task.