Method Drift›Speculative decoding
LayerSkip
LayerSkip: Enabling Early Exit Inference and Self-Speculative DecodingSpeculative decoding · first seen Apr 25, 2024
superseded — cited as a baseline and beaten by newer methods
7 papers critique it · 3 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites LayerSkip as a baseline.
“However, self-speculative decoding, which uses the same architecture for both draft and target models, inherently limits speedup.”
— Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design“The performance of EESD hinges on several factors: the early-exit position (which affects draft speed), the draft accuracy (i.e., token acceptance rate), and the number of drafted tokens per step (draft length). Notably, a trade-off exists that more layers involved in drafting improve the acceptance rate but also increase computational cost.”
— Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding“Prior works have relied on static configuration of E and γ, selected via offline grid search. This introduces two key limitations. First, the optimal E and γ vary significantly across tasks; configurations tuned for one task often underperform on others.”
— DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding“However, applying SSD directly to multimodal models is challenging, as deeper layers are often essential for capturing cross-modal interactions. Simply skipping layers and forwarding shallow outputs to the LM Head degrades performance.”
— FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference“More recent self-contained designs like Self-Speculative Decoding (Self-SD) (Zhang et al., 2024) and LayerSkip (Elhoushi et al., 2024) further attempt to reduce computational redundancy by skipping non-critical layers during inference. While these methods highlight the potential of exploiting structural redundancy within LLMs, they typically rely on offline optimization or fine-tuning to identify task-dependent layer configurations, making them less practical in real-world deployment.”
— SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration“While self-speculation simplifies the deployment pipeline, it often provides limited acceleration.”
— CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs“All existing self-speculative methods share a common assumption: the model is a homogeneous stack of similar layers, and the drafting strategy consists of skipping or shortcutting some of these layers. This assumption breaks down in hybrid architectures, where layers contain fundamentally different computational components.”
— Component-Aware Self-Speculative Decoding in Hybrid Language Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating LayerSkip. Values are copied from the source paper's tables — verify against the cited paper.
- S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
SoFT + S2D (ours) beats LayerSkip · Avg Speedup [Target-Independent Baselines (Fine-tuning Vicuna 7b layers 1-12)]
1.55 vs 1.51
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · BLEU4 [BLIP2-FlanT5]
43.6 vs 33.4
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · Speedup [BLIP2-FlanT5]
1.61 vs 1.25
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · BLEU4 [BLIP2-OPT]
43.4 vs 31.9
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · Speedup [BLIP2-OPT]
1.75 vs 1.39
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · Total (MM-Vet) [LLaVA-1.5-7B]
27.8 vs 25.7
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · Speedup [LLaVA-1.5-7B]
1.85 vs 1.71
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · MRR [VisDial]
43.9 vs 33.2
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · Speedup [NoCaps]
1.55 vs 1.45
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · BLEU4 [CLIP-LLAMA]
40.7 vs 27.4
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · Speedup [CLIP-LLAMA]
1.77 vs 1.45
- FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
FastVLM beats LayerSkip · dev [VQAv2]
83.9 vs 75.8
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 4, 2026
- component-aware self-speculative decodingComponent-Aware Self-Speculative Decoding in Hybrid Language ModelsMay 1, 2026
- Apr 22, 2026
- Apr 16, 2026
- Apr 2, 2026
- greedy multi-path block verification (GBV)Greedy Multi-Path Block Verification for Faster Decoding in Speculative SamplingFeb 18, 2026
- SDFPSDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM AccelerationFeb 5, 2026
- Feb 1, 2026
- CAS-Spec (Cascade Adaptive Self-Speculative Decoding)CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMsOct 30, 2025
- Oct 26, 2025
- Oct 17, 2025
- Oct 1, 2025