RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding
For medical VQA practitioners, this provides a more efficient and accurate model for clinical diagnosis, though the gains are incremental over existing methods.
RoiMAM introduces an efficient vision-language model for medical visual question answering that focuses on lesion-relevant regions, achieving higher accuracy (2% on SLAKE, 4.6% on PMC-VQA) with less than 20% of the model size of MedVInT-TD.
Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.