CLAIApr 10, 2025

SD$^2$: Self-Distilled Sparse Drafters

arXiv:2504.08838v22 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses inference efficiency for LLM users, offering incremental improvements in draft model compression and alignment.

The paper tackled the problem of reducing latency in Large Language Models (LLMs) through speculative decoding by introducing Self-Distilled Sparse Drafters (SD^2), which achieved a 1.59× higher Mean Accepted Length and reduced Multiply-Accumulate operations by over 43.87% compared to existing methods.

Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a 1.59$\times$ higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our 1.5B and 3B unstructured sparse drafters outperform both dense and layer-pruned models in terms of end-to-end latency improvements; highlighting the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes