CVCLAug 2, 2023

Grounded Image Text Matching with Mismatched Relation Reasoning

arXiv:2308.01236v214 citationsh-index: 37
Originality Incremental advance
AI Analysis

It addresses a specific challenge in visual-linguistic AI for researchers, but is incremental as it builds on existing transformer-based models and tasks.

This paper tackles the problem of evaluating relation understanding in visual-linguistic models by introducing the GITM-MR task, which requires models to determine if text describes an image and localize mismatched parts, and finds that pre-trained models lack data efficiency and length generalization ability.

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes