CV AIOct 31, 2023

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis

Yingshu Li, Yunyi Liu, Zhanyu Wang, Xinyu Liang, Lei Wang, Lingqiao Liu, Leyang Cui, Zhaopeng Tu, Longyue Wang, Luping Zhou

arXiv:2310.20381v519.345 citationsh-index: 17

Originality Synthesis-oriented

AI Analysis

This work assesses GPT-4V's potential for medical AI applications, highlighting its strengths and limitations, but it is incremental as it focuses on evaluation rather than new method development.

This paper evaluated GPT-4V's multimodal capabilities for medical image analysis, finding it excels in generating high-quality radiology reports and answering questions but performs poorly in medical visual grounding, with a noted discrepancy between quantitative and human evaluations.

This work conducts an evaluation of GPT-4V's multimodal capability for medical image analysis, with a focus on three representative tasks of radiology report generation, medical visual question answering, and medical visual grounding. For the evaluation, a set of prompts is designed for each task to induce the corresponding capability of GPT-4V to produce sufficiently good outputs. Three evaluation ways including quantitative analysis, human evaluation, and case study are employed to achieve an in-depth and extensive evaluation. Our evaluation shows that GPT-4V excels in understanding medical images and is able to generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis.

View on arXiv PDF

Similar