CLLGOct 30, 2025

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

arXiv:2510.26577v12 citations
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users of LLMs by improving inference speed, though it is incremental as it builds on existing speculative decoding methods.

The paper tackles the inference latency problem in Large Language Models by proposing CAST, a dynamic tree decoding approach that considers inference costs like GPU configurations and batch sizes, achieving speeds up to 5.2 times faster than conventional methods and outperforming state-of-the-art techniques by 5% to 20%.

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes