CLCVSep 6, 2023

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

arXiv:2309.02691v32 citationsh-index: 36Has Code
Originality Incremental advance
AI Analysis

This work addresses a key problem for researchers and developers in vision-language AI by highlighting and mitigating grounding inconsistencies, though it is incremental in nature.

The study tackled the inconsistency between phrase grounding and task performance in vision-language models by proposing a joint evaluation framework and three benchmarks, showing that brute-force training on grounding annotations can address this issue.

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes