CVAICLLGAug 1, 2018

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

arXiv:1808.00265v172 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of interpretability in VQA for researchers and practitioners by reducing reliance on costly annotations, though it is incremental as it builds on existing grounding methods.

The paper tackles the problem of training interpretable visual question answering models without expensive human-annotated visual grounding data by automatically mining supervision from available region descriptions and object annotations, resulting in a model that achieves state-of-the-art VQA accuracy and higher correlation with manual groundings.

A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes