PF AIOct 31, 2024

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

arXiv:2410.23537v18 citationsh-index: 3ICCAD

Originality Incremental advance

AI Analysis

This work addresses latency and throughput issues in LLM inference serving, which is critical for applications like ChatGPT, but it is an incremental improvement over existing systems.

The paper tackles the problem of inefficient scheduling in large language model (LLM) serving systems, which causes head-of-line blocking and long response times, by proposing ALISE, a framework that uses speculative scheduling and memory optimization techniques to improve throughput by up to 1.8x and 2.1x compared to vLLM under latency constraints.

Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal throughput for inference serving. However, due to the unpredictability of LLM execution, the first-come-first-serve (FCFS) scheduling policy employed by current LLM serving systems suffers from head-of-line (HoL) blocking issues and long job response times. In this paper, we propose a new efficient LLM inference serving framework, named ALISE. The key design paradigm of ALISE is to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads. Furthermore, to mitigate the memory overhead of the intermediate key-value (KV) cache, we employ a priority-based adaptive memory management protocol and quantization-based compression techniques. Evaluations demonstrate that in comparison to the state-of-the-art solution vLLM, ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.

View on arXiv PDF

Similar