LGJan 14

Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling

Zhixiang Liang, Beichen Huang, Zheng Wang, Minjia Zhang

arXiv:2601.09093v17.56 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency issues in reasoning tasks for users of large language models, though it is an incremental improvement over existing pruning methods.

The paper tackles the high computational cost and latency of test-time scaling in large language models by proposing STEP, a pruning framework that uses hidden states to evaluate reasoning steps and dynamically prunes traces, reducing inference latency by 45%-70% while improving accuracy.

Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP

View on arXiv PDF Code

Similar