CVApr 11, 2017

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

arXiv:1704.03162v2196 citations
Originality Incremental advance
AI Analysis

This work provides a strong baseline for visual question answering, potentially guiding more meaningful research in the field, though it is incremental as it builds on similar existing models.

The paper tackles the visual question answering task by introducing a simple model that sets a new state of the art on VQA benchmarks, achieving 64.6% accuracy on VQA 1.0 and 59.7% on VQA 2.0 with improvements of 0.4% and 0.5% over previous best results, respectively.

This paper presents a new baseline for visual question answering task. Given an image and a question in natural language, our model produces accurate answers according to the content of the image. Our model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark. On VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0, our model scores 59.7% on validation set outperforming best previously reported results by 0.5%. The results presented in this paper are especially interesting because very similar models have been tried before but significantly lower performance were reported. In light of the new results we hope to see more meaningful research on visual question answering in the future.

Code Implementations13 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes