Remodeling Semantic Relationships in Vision-Language Fine-Tuning
This work addresses a bottleneck in vision-language fine-tuning for constructing multimodal foundation models, with applications in tasks like visual question answering and image captioning.
The paper tackles the problem of existing vision-language fine-tuning methods overlooking semantic relationships within images, which leads to suboptimal performance. The proposed method improves multimodal alignment and fusion by extracting multilevel semantic features, projecting vision features to group related semantics, and using inheritable cross-attention to remove redundant visual relationships, outperforming all existing methods on eight foundation models and two downstream tasks.
Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.