DC AI CV LGDec 25, 2024

Efficiently Serving Large Multimodal Models Using EPD Disaggregation

Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan

arXiv:2501.05460v49.222 citationsh-index: 4Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses inefficiencies in serving multimodal AI models for applications requiring real-time performance, representing a novel method for a known bottleneck rather than a foundational advancement.

The paper tackles the problem of high computational and memory overhead in serving Large Multimodal Models (LMMs) by introducing Encode-Prefill-Decode (EPD) Disaggregation, which separates encoding, prefill, and decode stages onto dedicated resources, resulting in up to 15x lower peak memory utilization, 22x larger batch sizes, and up to 71% reduction in time to first token.

Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90-100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.

View on arXiv PDF Code

Similar