AI CLApr 3

VERT: Reliable LLM Judges for Radiology Report Evaluation

Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens, Asma Ben Abacha

arXiv:2604.0337643.9h-index: 15

AI Analysis

For researchers and practitioners in medical AI, this work provides a reliable and efficient LLM-based evaluation method for radiology reports across multiple modalities and anatomies.

The paper proposes VERT, an LLM-based metric for radiology report evaluation, and shows it improves correlation with radiologist judgments by up to 11.7% over GREEN. Fine-tuning Qwen3 30B with 1,300 samples yields up to 25% improvement and 37.2x faster inference.

Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.

View on arXiv PDF

Similar