Controllable and Verifiable Process Data Synthesis for Process Reward Models
For researchers working on process reward models and reasoning tasks, this work provides a method to generate high-quality process supervision data with controlled errors, addressing the need for fine-grained and verifiable process supervision.
The paper proposes a controllable and verifiable framework for synthesizing process supervision data for process reward models (PRMs), enabling precise injection of errors into reasoning chains. Experiments show that the synthesized data improves Best-of-8 reranking on logical reasoning benchmarks and transfers to mathematical reasoning.
Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.