CV MAJul 20, 2025

Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG

Rakesh Raj Madavan, Akshat Kaimal, Hashim Faisal, Chandrakala S

arXiv:2508.06496v13.61 citationsh-index: 1Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This work addresses the need for accurate and efficient medical VQA systems, which is crucial for healthcare applications, though it appears incremental as it builds on existing multimodal and RAG techniques.

The paper tackles the problem of achieving detailed precision in medical visual question answering (VQA) by introducing Med-GRIM, a model that uses prompt-embedded multimodal graph retrieval-augmented generation (RAG) to integrate domain-specific knowledge without heavy fine-tuning, resulting in large language model performance at a fraction of the computational cost.

An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: https://github.com/Rakesh-123-cryp/Med-GRIM.git

View on arXiv PDF Code

Similar