CV LGApr 16, 2025

Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets

Yongpei Ma, Pengyu Wang, Adam Dunn, Usman Naseem, Jinman Kim

arXiv:2504.11777v13.6h-index: 25

Originality Incremental advance

AI Analysis

This addresses inconsistency issues in MVQA systems for medical applications, but it is incremental as it builds on existing datasets and models with augmentation techniques.

The paper tackles the problem of linguistic variability undermining consistency in Medical Visual Question Answering (MVQA) systems by proposing a Semantically Equivalent Question Augmentation (SEQA) framework that uses LLMs to generate diverse rephrasings, resulting in an average accuracy improvement of 19.35% and a consistency metric improvement of 11.61% for fine-tuned models.

Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model's capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.

View on arXiv PDF

Similar