DCCLLGFeb 29, 2024

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

arXiv:2402.18789v314 citationsh-index: 11
Originality Highly original
AI Analysis

This addresses resource underutilization and cost inefficiencies for organizations deploying LLMs, offering a novel co-serving approach rather than an incremental improvement.

The paper tackles the problem of inefficient resource usage in LLM serving by introducing FlexLLM, a system that co-serves inference and finetuning on shared GPUs, achieving up to 80% GPU memory savings and improving finetuning throughput by 1.9-6.8× while maintaining inference latency SLOs.

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by $1.9-4.8\times$ under heavy inference workloads and $2.5-6.8\times$ under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes