LGAug 5, 2025

VRPRM: Process Reward Modeling via Visual Reasoning

arXiv:2508.03556v26 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses the problem of expensive data annotation for PRMs in LLM post-training, offering a more efficient approach, though it appears incremental as it builds on existing PRM and CoT concepts.

The paper tackles the high annotation cost and limited reasoning capabilities of Process Reward Models (PRMs) by proposing VRPRM, a model using visual reasoning and a two-stage training strategy, achieving up to 118% relative performance improvement with only 3.6K CoT-PRM data compared to a baseline with 400K data.

Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes