Value-Guided Search for Efficient Chain-of-Thought Reasoning
This work addresses the computational inefficiency in large language model reasoning for AI researchers and practitioners, though it is incremental as it builds on existing process reward models.
The paper tackles the problem of inefficient test-time compute scaling in chain-of-thought reasoning by proposing a value-guided search method, which reduces inference FLOPs by 30% compared to majority voting while achieving better performance.
In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.