Radiology-Aware Model-Based Evaluation Metric for Report Generation
This work addresses the need for better evaluation metrics in radiology report generation, which is crucial for medical AI applications, though it is incremental as it adapts an existing architecture to a specific domain.
The authors tackled the problem of evaluating machine-generated radiology reports by proposing a new automated metric based on the COMET architecture, adapted for radiology with medically-oriented checkpoints, including one trained on RadGraph. The results show moderate to high correlation with established metrics like BERTscore and BLEU, and one checkpoint achieved high correlation with human judgment from board-certified radiologists on 200 reports.
We propose a new automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain. We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph. Our results show that our metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores. Furthermore, we demonstrate that one of our checkpoints exhibits a high correlation with human judgment, as assessed using the publicly available annotations of six board-certified radiologists, using a set of 200 reports. We also performed our own analysis gathering annotations with two radiologists on a collection of 100 reports. The results indicate the potential effectiveness of our method as a radiology-specific evaluation metric. The code, data, and model checkpoints to reproduce our findings will be publicly available.