CLFeb 16, 2025

Uncertainty-Aware Step-wise Verification with Generative Reward Models

arXiv:2502.11250v117 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses reliability issues in process supervision for large language models, offering an incremental improvement for mathematical reasoning tasks.

The paper tackled the problem of unreliable step-wise verification in mathematical reasoning tasks by introducing CoT Entropy, a novel uncertainty quantification method that enhances generative reward models, leading to more robust verification.

Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes