CLMay 13, 2024

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

arXiv:2405.07542v214 citationsh-index: 32Has CodeNAACL
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in accelerating LLM inference for practical applications, though it is incremental in the context of speculative decoding techniques.

The paper tackles the problem of inconsistent token acceptance in multi-sample speculative decoding for Large Language Models, which reduces speedup due to padding overhead, and proposes a method that eliminates padding while maintaining computational efficiency, achieving competitive speedup ratios.

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient experiments demonstrate the efficacy of our method. Our code is available at https://github.com/niyunsheng/EMS-SD.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes