CLOct 18, 2023

SPEED: Speculative Pipelined Execution for Efficient Decoding

arXiv:2310.12072v251 citationsh-index: 97
Originality Incremental advance
AI Analysis

This addresses the problem of slow real-time inference for users of large language models, though it is an incremental improvement on existing speculative execution techniques.

The paper tackles the high inference latency of autoregressive generative LLMs by proposing SPEED, a method that speculatively executes multiple future tokens in parallel using early-layer hidden states, achieving latency reduction while maintaining model accuracy.

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes