AI AR DCJan 14, 2025

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli

arXiv:2501.08192v213.611 citationsh-index: 24Has Code

Originality Incremental advance

AI Analysis

This addresses memory bottlenecks and communication overheads to improve performance and scalability of LLM inference systems, though it appears incremental as it builds on prior overlapping methods.

The paper tackles the communication overhead in distributed LLM serving by proposing PRESERVE, a framework that prefetches model weights and KV-cache to on-chip cache during communication, achieving up to 1.6x end-to-end speedup on state-of-the-art LLMs.

Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost while limiting the scalability. Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations. In this paper, we propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip HBM memory to the on-chip cache of AI accelerators during the communication operations, which offers various advantages and performance improvements compared to prior methods. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.

View on arXiv PDF

Similar