CLCVMay 4, 2020

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

arXiv:2005.01655v11019 citationsHas Code
AI Analysis

This work addresses a critical flaw in a standard benchmark for visual referring expression recognition, which is important for researchers and practitioners in computer vision and natural language processing, though it is incremental in improving evaluation methods rather than proposing a new paradigm.

The paper tackled the problem that the RefCOCOg benchmark for visual referring expression recognition is flawed because 83.7% of test instances do not require reasoning on linguistic structure, and it showed that existing methods fail to exploit this structure, performing 12% to 23% lower than established progress when evaluated on a split requiring such reasoning.

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes