RadEval: A framework for radiology text evaluation
This work addresses the need for standardized evaluation in radiology report generation, facilitating reproducibility and benchmarking, though it is incremental as it builds on existing metrics and tools.
The authors tackled the problem of evaluating radiology texts by introducing RadEval, a unified open-source framework that consolidates diverse metrics, refines implementations, and releases an expert dataset with over 450 error labels, showing how metrics correlate with radiologist judgment.
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.