CVCLLGMLJun 17, 2020

Contrastive Learning for Weakly Supervised Phrase Grounding

arXiv:2006.09920v3158 citations
Originality Incremental advance
AI Analysis

This work addresses phrase grounding for vision-language tasks, offering a significant but incremental improvement over existing methods.

The paper tackles weakly supervised phrase grounding by maximizing mutual information between images and caption words using contrastive learning, achieving a 5.7% gain to reach 76.7% accuracy on Flickr30K Entities.

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes