CVAIMay 20, 2025

VoQA: Visual-only Question Answering

arXiv:2505.14227v11 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses a novel multimodal task for enhancing visual understanding in AI, though it is incremental as it builds on existing large vision-language models.

The paper tackles the problem of Visual-only Question Answering (VoQA), where questions are embedded in images without text, by introducing Guided Response Triggering Supervised Fine-tuning (GRT-SFT), which significantly improves model performance on this task.

We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images, without any accompanying textual input. This requires models to locate, recognize, and reason over visually embedded textual questions, posing challenges for existing large vision-language models (LVLMs), which show notable performance drops even with carefully designed prompts. To bridge this gap, we introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input, significantly improving model performance. Our work enhances models' capacity for human-like visual understanding in complex multimodal scenarios, where information, including language, is perceived visually.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes