DCLGDec 8, 2023

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

arXiv:2312.05215v320 citationsh-index: 8EuroSys
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient serving for users with sporadic, bursty requests across multiple fine-tuned LLMs, representing an incremental improvement in serving efficiency.

The paper tackles the challenge of serving multiple fine-tuned large language models concurrently by introducing DeltaZip, a system that compresses model deltas by up to 10x, achieving 2x to 12x throughput improvement over state-of-the-art systems.

Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes