PFAICLAug 11, 2025

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

arXiv:2508.08343v31 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in LLM-adapter serving systems to improve GPU throughput, representing an incremental advance in optimization techniques for this domain.

This study tackled the problem of GPU memory starvation when serving many LLM-adapters by developing a data-driven ML approach to optimize concurrent and parallel adapter configurations, achieving prediction errors of at most 7.2% for optimal settings under real-world workloads.

With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. The code is publicly available at https://github.com/FerranAgulloLopez/GPULLMAdapterOptimization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes