CVMMApr 15, 2023

CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval

arXiv:2304.07567v21 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses a specific challenge in vision-language retrieval for researchers, offering an incremental improvement by better balancing cross-modal and single-modal tasks.

The paper tackles the problem that enforcing hard cross-modal consistency in vision-language retrieval can degrade single-modal retrieval by disrupting intra-modal relationships, and proposes CoVLR to coordinate cross-modal alignment with intra-modal structure preservation, improving single-modal retrieval accuracy while maintaining cross-modal performance.

Current vision-language retrieval aims to perform cross-modal instance search, in which the core idea is to learn the consistent visionlanguage representations. Although the performance of cross-modal retrieval has greatly improved with the development of deep models, we unfortunately find that traditional hard consistency may destroy the original relationships among single-modal instances, leading the performance degradation for single-modal retrieval. To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations.To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks. CoVLR addresses this challenge by developing an effective meta-optimization based strategy, in which the cross-modal consistency objective and the intra-modal relation preserving objective are acted as the meta-train and meta-test tasks, thereby CoVLR encourages both tasks to be optimized in a coordinated way. Consequently, we can simultaneously insure cross-modal consistency and intra-modal structure. Experiments on different datasets validate CoVLR can improve single-modal retrieval accuracy whilst preserving crossmodal retrieval capacity compared with the baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes