CVAILGMay 26, 2025

MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering

arXiv:2505.19455v21 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses modality imbalance in continual learning for visual question answering, offering an incremental improvement over existing prompt-tuning approaches.

The paper tackled the problem of modality imbalance in continual visual question answering by proposing MM-Prompt, a framework with cross-modal prompt query and recovery, which improved accuracy and knowledge retention over prior methods.

Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes