More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
This addresses the annotation efficiency problem for mathematical reasoning tasks, offering a scalable paradigm for process supervision.
The paper tackles the problem of costly manual step annotations in process reward modeling by introducing EDU-PRM, an entropy-driven framework that dynamically segments reasoning steps based on predictive entropy, achieving comparable results to state-of-the-art models with only 1.5% of training data and boosting accuracy from 64.7% to 67.3% while reducing token usage by 32%.
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.