LGJan 30

TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification

Haoyun Jiang, Junqi He, Feng Hong, Xinlong Yang, Jianwei Zhang, Zheng Li, Zhengyang Zhuge, Zhiyong Chen, Bo Han, Junyang Lin, Jiangchao Yao

arXiv:2601.23180v12.71 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses inference efficiency for LLM users by reducing computational costs in speculative decoding, though it builds incrementally on existing methods.

The paper tackles the verification cost bottleneck in speculative decoding for LLMs by proposing TriSpec, a ternary framework that uses a lightweight proxy to reduce target model invocations. Experiments show up to 35% speedup over standard speculative decoding with 50% fewer target model calls while maintaining accuracy.

Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.

View on arXiv PDF

Similar