LGNov 1, 2025

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev

arXiv:2511.00699v27.11 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses efficiency problems for users deploying LLMs in resource-constrained environments, representing an incremental improvement over existing pruning methods.

The paper tackles the high computational cost of generating multiple candidate solutions in large language models by introducing KAPPA, an inference-time pruning method that reduces peak memory by up to ~60% and total token generation by ~90% while maintaining accuracy on benchmarks like GSM8K and MATH500.

Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.

View on arXiv PDF

Similar