Where To Look: Focus Regions for Visual Question Answering
This addresses the problem of improving accuracy in visual question answering for AI systems, though it appears incremental as it builds on existing region selection methods.
The paper tackles visual question answering by learning to select relevant image regions for text-based queries, achieving significant improvements on specific question types like 'what color' and 'what room'.
We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest human-annotated visual question answering dataset to our knowledge.