AI CLApr 16

Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

arXiv:2604.146821.3

AI Analysis

For researchers and engineers optimizing LLM inference, this provides domain-specific insights for speculation budgets and draft model selection, though the findings are incremental.

This paper empirically studies acceptance dynamics in tree-based speculative decoding across four NLP domains (code, math, logic, chat), finding that task type predicts acceptance better than tree depth, and only chat achieves an expected accepted length >1.0 token per step. The entropy-acceptance correlation is weakly negative across domains, with chat showing highest entropy yet highest acceptance due to RLHF-aligned register.

Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains (rho in [-0.20, -0.15]). Counterintuitively, chat produces the highest entropy yet the highest acceptance rate. We attribute this divergence to the lexical predictability of RLHF-aligned register. These findings have direct implications for domain-aware speculation budgets and draft-model selection strategies. Index Terms--speculative decoding, large language model inference, tree attention, draft model, acceptance probability, LLM efficiency

View on arXiv PDF

Similar