Why Does a Visual Question Have Different Answers?
This work addresses a fundamental challenge in visual question answering by explaining answer variability, which is incremental as it builds on existing VQA tasks.
The paper tackles the problem of why visual questions often receive different answers from different people, proposing a taxonomy of nine reasons and creating two labeled datasets of ~45,000 visual questions to identify these reasons. It introduces a novel algorithm to predict which reasons cause answer differences, showing experimental advantages over baselines on two datasets.
Visual question answering is the task of returning the answer to a question about an image. A challenge is that different people often provide different answers to the same visual question. To our knowledge, this is the first work that aims to understand why. We propose a taxonomy of nine plausible reasons, and create two labelled datasets consisting of ~45,000 visual questions indicating which reasons led to answer differences. We then propose a novel problem of predicting directly from a visual question which reasons will cause answer differences as well as a novel algorithm for this purpose. Experiments demonstrate the advantage of our approach over several related baselines on two diverse datasets. We publicly share the datasets and code at https://vizwiz.org.