LG AIMay 7

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Rachel Ma, Dylan Hadfield-Menell, Kristjan Greenewald

arXiv:2605.0678578.3

Predicted impact top 17% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers using PRMs in mathematical reasoning, this work provides a principled calibration method with structural guarantees, though it is incremental as it adapts existing conditional OT techniques.

The paper introduces a novel method using conditional optimal transport to calibrate Process Reward Models (PRMs), improving calibration and confidence estimation for inference-time scaling. On MATH-500 and AIME benchmarks, the method substantially improves calibration over uncalibrated PRMs and quantile regression, and generally improves downstream Best-of-N performance.

Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.

View on arXiv PDF

Similar