CVAIMar 4, 2025

BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA

arXiv:2503.02476v11 citationsh-index: 1MICCAI
AI Analysis

This work addresses multimodal alignment challenges in biomedical VQA, which has applications in assistive medical diagnosis, though it appears incremental with a novel method for a known bottleneck.

The authors tackled the problem of suboptimal multimodal semantic alignment in biomedical visual question answering (VQA) by proposing BioD2C, a dual-level semantic consistency constraint framework that achieves state-of-the-art performance across multiple downstream datasets.

Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the image semantics with the question semantics. Specifically, in this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context, and train our model on this dataset. Extensive experimental results demonstrate that BioD2C achieves state-of-the-art (SOTA) performance across multiple downstream datasets, showcasing its robustness, generalizability, and potential to advance biomedical VQA research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes