CL CVMay 18

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

Rafid Ahmed, Intesar Tahmid, Mir Sazzat Hossain, Tasnimul Hossain Tomal, Md Fahim, Md Farhad Alam Bhuiyan

arXiv:2605.1811185.6Has Code

Predicted impact top 50% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the lack of MedVQA resources for Bangla, a low-resource language, highlighting severe limitations of current models in fine-grained medical reasoning for non-English languages.

The authors introduce BanglaMedVQA, the first MedVQA dataset for Bangla, and benchmark several LLMs/LVLMs, finding that even top models like Gemini and GPT-4.1 mini perform poorly on specialized diagnostic questions, with open-source models like Gemma-3 occasionally outperforming them in general categories but still struggling with clinical complexity.

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

View on arXiv PDF

Similar