CVAILGNov 1, 2024

Right this way: Can VLMs Guide Us to See More to Answer Questions?

arXiv:2411.00394v114 citationsh-index: 8NIPS
Originality Incremental advance
AI Analysis

This addresses a gap in VLM capabilities for assisting visually impaired individuals by enabling models to indicate when and how to adjust images for better question-answering.

The paper tackles the problem of Vision Language Models (VLMs) providing forced answers without assessing visual information sufficiency in Visual Question Answering (VQA), and shows that fine-tuning VLMs with synthetic data leads to significant performance improvements in guiding image adjustments.

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating ``where to know'' scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes