CLAILGJan 23, 2024

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

arXiv:2401.12522v216 citationsh-index: 3Expert syst appl
Originality Incremental advance
AI Analysis

This addresses latency issues in LLM inference for users needing faster generation, but it is incremental as it builds on existing acceleration techniques.

The paper tackles the inefficiency of autoregressive generation in large language models (LLMs) by proposing BiTA, a method that uses bi-directional tuning for semi-autoregressive generation and draft verification, achieving a 2.7x speedup on LLaMA-2-70B-Chat on the MT-Bench benchmark.

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes