A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
It provides a comprehensive overview for researchers and practitioners working on improving efficiency in real-time LLM applications, but it is incremental as it summarizes existing methods without introducing new techniques.
This paper surveys accelerated generation techniques in large language models to address high inference latency from sequential processing, categorizing methods like speculative decoding and early exiting to guide future research.
Despite the crucial importance of accelerating text generation in large language models (LLMs) for efficiently producing content, the sequential nature of this process often leads to high inference latency, posing challenges for real-time applications. Various techniques have been proposed and developed to address these challenges and improve efficiency. This paper presents a comprehensive survey of accelerated generation techniques in autoregressive language models, aiming to understand the state-of-the-art methods and their applications. We categorize these techniques into several key areas: speculative decoding, early exiting mechanisms, and non-autoregressive methods. We discuss each category's underlying principles, advantages, limitations, and recent advancements. Through this survey, we aim to offer insights into the current landscape of techniques in LLMs and provide guidance for future research directions in this critical area of natural language processing.