CLMay 22, 2025

KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

arXiv:2505.16162v17 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses domain generalizability for LLM inference acceleration, offering an incremental improvement over existing self-speculative decoding methods.

The paper tackles the problem of domain sensitivity in Self-Speculative Decoding for LLM inference acceleration by introducing KNN-SSD, which uses KNN search to optimize skipped layers, resulting in a 1.3x-1.6x speedup.

Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes