CVOct 16, 2024

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Zhiyuan Ma, Jianjun Li, Guohui Li, Kaiyan Huang

arXiv:2410.12595v110.59 citationsh-index: 17MM

Originality Incremental advance

AI Analysis

This work addresses the need for more efficient and effective vision-language pre-training methods, particularly for applications in social media and multimodal AI, though it appears incremental as it builds on existing VLP approaches.

The paper tackles the limitations of cross-modal contrastive learning in vision-language pre-training by proposing CMAL, a framework that uses anchor points and associative learning to improve performance with less data, achieving competitive results on downstream tasks and new state-of-the-art on SNLI-VE and REC (testA).

With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into separate hypersphere spaces to learn intra-modal hidden features, and then design a cross-modal associative prompt layer to perform anchor point masking and swap feature filling for constructing a hybrid cross-modal associative prompt. Afterwards, we exploit a unified semantic encoder to learn their cross-modal interactive features for context adaptation. Finally, we design an associative mapping classification layer to learn potential associative mappings between modalities at anchor points, within which we develop a fresh self-supervised associative mapping classification task to boost CMAL's performance. Experimental results verify the effectiveness of CMAL, showing that it achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks, with significantly fewer corpus. Especially, CMAL obtains new state-of-the-art results on SNLI-VE and REC (testA).

View on arXiv PDF

Similar