CVMay 20, 2021

More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

arXiv:2105.09597v312 citations
Originality Incremental advance
AI Analysis

This work addresses the limitation of inaccurate attention models in cross-modal matching, which is important for applications like image search and captioning, but it is incremental as it builds on existing attention-based methods.

The authors tackled the problem of sub-optimal cross-modal attention in image-text matching by proposing two contrastive training constraints (CCR and CCS) that provide direct supervision without explicit annotations, resulting in improved retrieval performance and attention metrics on Flickr30k and MS-COCO datasets.

Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be easily integrated into existing cross-modal attention models. Additionally, we introduce three metrics including Attention Precision, Recall, and F1-Score to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints improves the model performance in terms of both retrieval performance and attention metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes