LGAICLMar 28, 2025

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

arXiv:2503.22233v36 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the annotation efficiency problem for mathematical reasoning tasks, offering a scalable paradigm for process supervision.

The paper tackles the problem of costly manual step annotations in process reward modeling by introducing EDU-PRM, an entropy-driven framework that dynamically segments reasoning steps based on predictive entropy, achieving comparable results to state-of-the-art models with only 1.5% of training data and boosting accuracy from 64.7% to 67.3% while reducing token usage by 32%.

We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes