CLJan 31, 2025

Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Brian J Chan, MaoXun Huang, Jui-Hung Cheng, Chao-Ting Chen, Hen-Hsen Huang

arXiv:2502.00085v214.711 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses memory constraints for deploying large language models in resource-limited environments, though it appears incremental as it optimizes an existing decoding approach.

This paper tackles the memory inefficiency of batch-based beam search in large language models by introducing a trie-based parallel decoding method that shares a single KV cache across beams with common prefixes, resulting in 4-8× memory savings and up to 2.4× faster decoding without quality loss.

This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures, Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501), using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4--8$\times$) and up to 2.4$\times$ faster decoding, without compromising generation quality. These results highlight our method's suitability for memory-constrained environments and large-scale deployments.

View on arXiv PDF

Similar