CLAug 26, 2025

Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

arXiv:2508.18687v12 citationsh-index: 11MICCAI
Originality Incremental advance
AI Analysis

This addresses reliability issues in high-stakes medical diagnosis by enhancing robustness against question variations, though it is incremental as it builds on existing Med-VLM frameworks.

The paper tackled the problem of inconsistent answers in Medical Visual Question Answering (Med-VQA) when questions are rephrased, revealing that state-of-the-art models like LLaVA-Med suffer a 40% drop in Recall on a new robustness dataset (RoMed), and proposed a method (CCL) that improves answer consistency by 50% on RoMed while achieving SOTA on standard benchmarks.

In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40\% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50\% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes