LGAIApr 14, 2025

Efficient Process Reward Model Training via Active Learning

arXiv:2504.10559v117 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the problem of high labeling costs for PRMs in AI, offering an incremental improvement in efficiency and performance for training LLMs with step-level supervision.

The paper tackles the challenge of scaling up training data annotation for Process Reward Models (PRMs) by proposing ActPRM, an active learning approach that reduces annotation costs by 50% while achieving comparable or better performance, and further uses it to filter data, yielding a new state-of-the-art PRM with scores of 75.0% on ProcessBench and 65.5% on PRMBench.

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes