The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)
This addresses the problem of limited linguistic richness in VQA for researchers, though it is incremental as it builds on existing data with rule-based processing.
The authors tackled the limitation of short, repetitive answers in Visual Question Answering (VQA) by introducing the Full-Sentence Visual Question Answering (FSVQA) dataset, which contains nearly 1 million question-answer pairs with full-sentence answers derived from existing datasets.
Visual Question Answering (VQA) task has showcased a new stage of interaction between language and vision, two of the most pivotal components of artificial intelligence. However, it has mostly focused on generating short and repetitive answers, mostly single words, which fall short of rich linguistic capabilities of humans. We introduce Full-Sentence Visual Question Answering (FSVQA) dataset, consisting of nearly 1 million pairs of questions and full-sentence answers for images, built by applying a number of rule-based natural language processing techniques to original VQA dataset and captions in the MS COCO dataset. This poses many additional complexities to conventional VQA task, and we provide a baseline for approaching and evaluating the task, on top of which we invite the research community to build further improvements.