CLAIFeb 19, 2024

Plato: Plan to Efficiently Decode for Large Language Model Inference

arXiv:2402.12280v24 citationsh-index: 19
Originality Highly original
AI Analysis

This addresses efficiency bottlenecks in LLM inference for users needing faster, high-quality responses, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the problem of high computational and memory overhead in large language model inference by proposing Plato, a semantic-aware parallel decoding method that improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality.

Large language models (LLMs) have achieved remarkable success in natural language tasks, but their inference incurs substantial computational and memory overhead. To improve efficiency, parallel decoding methods like Skeleton-of-Thought (SoT) decompose prompts into sub-problems for concurrent processing. However, these methods significantly compromise answer quality by treating semantically linked sub-problems as independent. We propose Plato, a novel approach that co-designs algorithms and systems for semantic-aware parallel decoding. Plato leverages LLMs to organize sub-problems into a dependency graph based on logical and causal relationships, enabling concurrent decoding of non-dependent nodes while preserving answer coherence and quality. To further enhance efficiency, Plato pipelines planning and node decoding stages, implements a global context cache, and carefully structures node inference prompts to maximize key-value cache reuse and minimize overhead. Our evaluations show that Plato improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality. Compared to SoT, Plato demonstrates a remarkable 90% quality net-win rate. Ablation studies reveal that our pipeline design improves speedup by 29%, while our KV cache reuse optimization reduces overhead by 75%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes