LGCLFeb 15, 2024

BitDelta: Your Fine-Tune May Only Be Worth One Bit

arXiv:2402.10193v345 citationsh-index: 15NIPS
Originality Incremental advance
AI Analysis

This enables efficient multi-tenant serving and storage of fine-tuned models, though it is incremental as it builds on existing quantization and fine-tuning methods.

The paper tackles the problem of high computational and memory costs in serving multiple fine-tuned large language models by showing that the fine-tuning delta can be quantized to 1 bit without performance loss, reducing GPU memory by over 10x.

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes