CL LGApr 8, 2025

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram

arXiv:2504.05598v212.06 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This incremental improvement addresses the computational inefficiency in speculative decoding for LLM inference, benefiting users by reducing inference time without quality loss.

The paper tackles the problem of inefficient hyperparameter selection in speculative decoding for large language models by introducing DEL, a dynamic method that adaptively chooses exit layers and speculation lengths, achieving speedups of 2.16× to 2.62× over auto-regressive decoding and outperforming state-of-the-art methods by up to 0.19×.

Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL (Dynamic Exit Layer), a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.62\times$ over vanilla auto-regressive decoding and improves upon state-of-the-art SD methods, which peak at $2.43\times$, by up to $0.19\times$. The code is available at https://github.com/hoenza/DEL.

View on arXiv PDF Code

Similar