CL AI CV LGNov 18, 2020

Inspecting state of the art performance and NLP metrics in image-based medical report generation

Pablo Pino, Denis Parra, Pablo Messina, Cecilia Besa, Sergio Uribe

arXiv:2011.09257v30.511 citationsHas Code

Originality Incremental advance

AI Analysis

This paper highlights a significant problem for researchers and developers in medical report generation, indicating that current NLP metrics may not accurately reflect clinical accuracy.

This paper investigates the reported progress in image-based medical report generation by comparing state-of-the-art models against simple baselines. It finds that naive approaches achieve near state-of-the-art performance on traditional NLP metrics.

Several deep learning architectures have been proposed over the last years to deal with the problem of generating a written report given an imaging exam as input. Most works evaluate the generated reports using standard Natural Language Processing (NLP) metrics (e.g. BLEU, ROUGE), reporting significant progress. In this article, we contrast this progress by comparing state of the art (SOTA) models against weak baselines. We show that simple and even naive approaches yield near SOTA performance on most traditional NLP metrics. We conclude that evaluation methods in this task should be further studied towards correctly measuring clinical accuracy, ideally involving physicians to contribute to this end.

View on arXiv PDF Code

Similar