CL AIFeb 19, 2024

Plato: Plan to Efficiently Decode for Large Language Model Inference

Shuowei Jin, Xueshen Liu, Yongji Wu, Haizhong Zheng, Qingzhao Zhang, Atul Prakash, Matthew Lentz, Danyang Zhuo, Feng Qian, Z. Morley Mao

arXiv:2402.12280v22.74 citationsh-index: 19

Originality Highly original

AI Analysis

This addresses efficiency bottlenecks in LLM inference for users needing faster, high-quality responses, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the problem of high computational and memory overhead in large language model inference by proposing Plato, a semantic-aware parallel decoding method that improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality.

Large language models (LLMs) have achieved remarkable success in natural language tasks, but their inference incurs substantial computational and memory overhead. To improve efficiency, parallel decoding methods like Skeleton-of-Thought (SoT) decompose prompts into sub-problems for concurrent processing. However, these methods significantly compromise answer quality by treating semantically linked sub-problems as independent. We propose Plato, a novel approach that co-designs algorithms and systems for semantic-aware parallel decoding. Plato leverages LLMs to organize sub-problems into a dependency graph based on logical and causal relationships, enabling concurrent decoding of non-dependent nodes while preserving answer coherence and quality. To further enhance efficiency, Plato pipelines planning and node decoding stages, implements a global context cache, and carefully structures node inference prompts to maximize key-value cache reuse and minimize overhead. Our evaluations show that Plato improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality. Compared to SoT, Plato demonstrates a remarkable 90% quality net-win rate. Ablation studies reveal that our pipeline design improves speedup by 29%, while our KV cache reuse optimization reduces overhead by 75%.

View on arXiv PDF

Similar