Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
This work addresses the issue of ambiguous user queries in VQA systems, which is incremental as it builds on existing research by focusing on interactive clarification rather than rephrasing.
The paper tackles the problem of ambiguous visual questions in visual question answering (VQA) by introducing the ClearVQA benchmark to assess and improve vision-language models' ability to resolve ambiguities through interaction, addressing the lack of benchmarks and models' tendency to answer rather than ask for clarification.
In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.