CLAICVApr 23, 2020

Visual Question Answering Using Semantic Information from Image Descriptions

arXiv:2004.10966v2
AI Analysis

This work addresses the challenge of improving accuracy and reducing training data requirements in visual question answering, which is an incremental advancement for AI systems in image understanding.

The authors tackled the problem of visual question answering by proposing a deep neural architecture that combines region-based image features, natural language questions, and semantic knowledge from image descriptions to generate open-ended answers, achieving excellent results compared to a strong baseline.

In this work, we propose a deep neural architecture that uses an attention mechanism which utilizes region based image features, the natural language question asked, and semantic knowledge extracted from the regions of an image to produce open-ended answers for questions asked in a visual question answering (VQA) task. The combination of both region based features and region based textual information about the image bolsters a model to more accurately respond to questions and potentially do so with less required training data. We evaluate our proposed architecture on a VQA task against a strong baseline and show that our method achieves excellent results on this task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes