CLAIOct 17, 2024

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

arXiv:2410.13344v11 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses inference efficiency for users of large language models, offering a novel adaptive method that improves upon existing parallel decoding techniques.

The paper tackles the bottleneck in LLM inference speed by proposing Cerberus, an adaptive parallel decoding framework that dynamically chooses decoding approaches, achieving up to 2.12x speedup over auto-regressive decoding and outperforming Medusa with 10-30% higher acceleration and better generation quality.

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have identified two key issues with existing parallel decoding frameworks: (1) decoding heads fail to balance prediction accuracy and the parallelism of execution, and (2) parallel decoding is not a universal solution, as it can bring unnecessary overheads at some challenging decoding steps. To address these issues, we propose Cerberus, an adaptive parallel decoding framework introduces the gating mechanism to enable the LLMs to adaptively choose appropriate decoding approaches at each decoding step, along with introducing a new paradigm of decoding heads that introduce the sequential knowledge while maintaining execution parallelism. The experiment results demonstrate that the Cerberus can achieve up to 2.12x speed up compared to auto-regressive decoding, and outperforms one of the leading parallel decoding frameworks, Medusa, with a 10% - 30% increase in acceleration and superior generation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes