LG AIApr 14, 2025

Efficient Process Reward Model Training via Active Learning

Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou

arXiv:2504.10559v123.317 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of high labeling costs for PRMs in AI, offering an incremental improvement in efficiency and performance for training LLMs with step-level supervision.

The paper tackles the challenge of scaling up training data annotation for Process Reward Models (PRMs) by proposing ActPRM, an active learning approach that reduces annotation costs by 50% while achieving comparable or better performance, and further uses it to filter data, yielding a new state-of-the-art PRM with scores of 75.0% on ProcessBench and 65.5% on PRMBench.

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

View on arXiv PDF Code

Similar