CLAICVApr 18, 2018

Object Ordering with Bidirectional Matchings for Visual Reasoning

arXiv:1804.06870v21094 citations
Originality Incremental advance
AI Analysis

This addresses the problem of improving accuracy in visual reasoning tasks for AI systems, representing an incremental advancement with specific gains.

The paper tackles the challenge of visual reasoning with compositional natural language instructions on the NLVR dataset by proposing an end-to-end neural model that uses joint bidirectional attention and an RL-based pointer network to map phrases to objects and order them, achieving 4-6% absolute improvements over state-of-the-art on both structured and raw image versions.

Visual reasoning with compositional natural language instructions, e.g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image. Further, this mapping needs to be processed to answer the question in the statement given the ordering and relationship of the objects across three similar images. In this paper, we propose a novel end-to-end neural model for the NLVR task, where we first use joint bidirectional attention to build a two-way conditioning between the visual information and the language phrases. Next, we use an RL-based pointer network to sort and process the varying number of unordered objects (so as to match the order of the statement phrases) in each of the three images and then pool over the three decisions. Our model achieves strong improvements (of 4-6% absolute) over the state-of-the-art on both the structured representation and raw image versions of the dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes