CVFeb 24, 2025

SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

arXiv:2502.16786v27 citationsh-index: 6Has CodeIEEE transactions on multimedia
Originality Incremental advance
AI Analysis

This work addresses computational inefficiency in visual grounding for AI applications, representing an incremental improvement over existing methods.

The paper tackles the problem of inefficient cross-modal alignment in visual grounding by introducing a step-wise multimodal fusion and adaptation framework called SwimVG, which achieves remarkable abilities and considerable efficiency benefits on four widely-used benchmarks.

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes