AIFeb 28, 2025

Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

arXiv:2502.20704v47 citationsh-index: 14ACL
Originality Incremental advance
AI Analysis

This addresses the problem of inference efficiency for users of large language models by enabling a tunable trade-off between accuracy and speed, though it is incremental as it builds upon existing SD methods.

The paper tackles the limitation of Speculative Decoding (SD) in achieving higher inference speed due to strict distributional equivalence, introducing Fuzzy Speculative Decoding (FSD) that allows controlled divergence to trade generation quality for speed, resulting in runtime improvements of over 5 tokens per second faster than SD with only about a 2% accuracy reduction.

Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model's generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension's efficiency while allowing it to leverage FSD's tunable quality-speed trade-off.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes