PFAINov 12, 2024

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

arXiv:2411.07447v44 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks for users deploying LLMs in databases and other resource-intensive applications, though it is incremental as it builds on existing database methods.

The paper tackled the problem of slow LLM inference in concurrent settings by analyzing performance and identifying a lack of resource cost models and optimization strategies for scheduling requests and managing GPU cache. It adapted database techniques to build cost models and a new cache replacement policy, resulting in substantial GPU cost savings.

LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and mechanism have been often regarded as a black box, limiting the expansion of the use of LLMs inside databases and other performance-critical applications. This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference. We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests with their intermediate results in a cache reside in GPU memory when executing multiple concurrent inference requests. We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference, which can substantially save GPU costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes