CVAIMar 29, 2019

Relation-Aware Graph Attention Network for Visual Question Answering

arXiv:1903.12314v331.3393 citationsHas Code
Originality Highly original
AI Analysis

This addresses the challenge of understanding interactive dynamics in visual scenes for visual question answering, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of answering complex questions about images by proposing a Relation-aware Graph Attention Network (ReGAT) that models explicit and implicit object relations, achieving state-of-the-art performance on VQA 2.0 and VQA-CP v2 datasets.

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes