CL AI LGMar 9

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards, Wei Xue, Sirui Han, Yike Guo, Gabriele Scalia

arXiv:2603.08095v19.4h-index: 14

Predicted impact top 57% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the problem of efficiently training reliable Process Reward Models for scientific reasoning tasks, which is crucial for applications where the reasoning process's veracity is as important as the final outcome, benefiting researchers and developers in AI for science. This is an incremental improvement in weak-to-strong generalization.

The paper tackles the challenge of training Process Reward Models (PRMs) using noisy weak supervision, which is typically difficult due to the lack of prescriptive guidelines for selecting high-quality training signals. The authors introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework, which stratifies supervision signals based on Self-Consensus and Neighborhood-Consensus metrics, and uses a curriculum of balanced sampling and reliability-aware masking to train robust PRMs without exhaustive expert annotation.

In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

View on arXiv PDF

Similar