CL CVJun 2

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

arXiv:2606.0369319.3h-index: 7Has Code

Predicted impact top 62% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners deploying medical VLMs in non-English clinical settings, this work highlights the need for multilingual evaluation to ensure model reliability.

The paper introduces IndoRad-VQA, an Indonesian adaptation of the VQA-RAD benchmark, and finds that medical VLMs show a performance drop of 8-25% when answering radiology questions in Indonesian compared to English, revealing a language robustness gap.

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

View on arXiv PDF

Similar