CVCLLGJun 30, 2022

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

arXiv:2206.15462v432 citationsh-index: 41
Originality Incremental advance
AI Analysis

This addresses visual grounding for AI systems that interpret images and text, offering an incremental improvement with strong benchmark gains.

The paper tackled the problem of improving visual grounding in vision-language models by proposing a margin-based loss to align gradient-based explanations with human region annotations, achieving state-of-the-art accuracy of 86.49% on Flickr30k, an absolute improvement of 5.38%.

We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes