CVAIJun 30, 2023

Multimodal Prompt Retrieval for Generative Visual Question Answering

arXiv:2306.17675v13 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the challenge of low-resource domain adaptation in VQA, such as in medicine, though it is incremental as it builds on existing generative models.

The paper tackles the problem of overfitting and poor generalization in visual question answering (VQA) by proposing a generative model enhanced by multimodal prompt retrieval, which improves accuracy by up to 30% in few-shot domain adaptation on medical VQA tasks.

Recent years have witnessed impressive results of pre-trained vision-language models on knowledge-intensive tasks such as visual question answering (VQA). Despite the recent advances in VQA, existing methods mainly adopt a discriminative formulation that predicts answers within a pre-defined label set, leading to easy overfitting on low-resource domains with limited labeled data (e.g., medicine) and poor generalization under domain shift to another dataset. To tackle this limitation, we propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes