CL AIOct 27, 2025

Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports

Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz

arXiv:2510.23217v1h-index: 21

Originality Incremental advance

AI Analysis

This addresses safety risks in clinical LVLMs by providing a model-agnostic verification method, though it is incremental as it builds on existing reward modeling techniques.

The paper tackles the problem of hallucinations in radiology report generation by Large Vision-Language Models (LVLMs) by introducing a sentence-level Process Reward Model (PRM) for verification, which improves metrics like Matthews Correlation Coefficient by 7.5% and F1-CheXbert by up to 7.4%.

Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations

View on arXiv PDF

Similar