CL AIDec 15, 2025

Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun, Ali Mao, Lei Xu, mingmin Chen

arXiv:2512.13194v31 citations

Originality Incremental advance

AI Analysis

This addresses efficiency bottlenecks in LLM inference for users needing faster generation, though it is an incremental improvement over existing speculative decoding methods.

The paper tackled the problem of random rejections in speculative decoding for large language models by introducing Efficient Adaptive Rejection Sampling (EARS), which dynamically adjusts acceptance thresholds based on model uncertainty, resulting in up to an 18.12% increase in throughput with a 0.84% accuracy drop on GSM8K.

Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

View on arXiv PDF

Similar