CLAILGFeb 19, 2024

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

arXiv:2402.11809v338 citationsh-index: 3ACL
Originality Incremental advance
AI Analysis

This addresses the bottleneck of high computational cost for users of LLMs, offering a lossless acceleration method that is incremental by building on speculative decoding and semi-autoregressive techniques.

This research tackled the problem of slow inference speed in large language models (LLMs) by proposing SPACE, a method that accelerates LLM inference with speedups of 2.7x to 4.0x on HumanEval-X while maintaining output quality.

This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes