LGAug 6, 2025

CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference

arXiv:2508.04462v24.1h-index: 1

Originality Incremental advance

AI Analysis

This addresses a performance bottleneck in LLM inference for applications requiring faster generation, though it appears incremental as it builds on speculative decoding.

The paper tackles the inefficiency of existing speculative decoding methods for LLM inference by proposing CARD, a cache-assisted parallel framework using a query-and-correct paradigm, which achieves up to 4.83x acceleration over vanilla autoregressive decoding without fine-tuning.

Speculative decoding (SD), where a draft model provides multiple candidate tokens for the target model to verify in parallel, has demonstrated significant potential for accelerating LLM inference. Yet, existing SD approaches adhere to a strict draft-then-verify paradigm, enforcing a sequential process that hampers performance and constrains the draft model's capacity. Moreover, rejecting a token in the candidate sequence invalidates all subsequent tokens, leading to wasted computation during drafting. To overcome these limitations, we propose a cache-assisted parallel speculative decoding framework called CARD, which employs a novel query-and-correct paradigm. Our approach decouples drafting from verification: the draft model populates a shared cache with candidate tokens, while the target model concurrently refines the draft's trajectory. This enables inference at near-draft-speed, effectively leveraging the draft model's efficiency without additional fine-tuning. Experimental results show that CARD significantly outperforms existing state-of-the-art methods, achieving up to a 4.83x acceleration over vanilla autoregressive decoding, with no fine-tuning required for either models.

View on arXiv PDF

Similar