CVAIApr 13

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

arXiv:2601.0616566.33 citationsh-index: 6
AI Analysis

For VLM developers and users, the study reveals that benchmark performance overestimates real-world capability due to query under-specification, a critical gap for deployment.

Vision-language models (VLMs) struggle with real-world under-specified queries; even top models like GPT-5 and Gemini 2.5 Pro achieve under 50% accuracy on such queries, while explicit rewrites yield 8–22 point improvements.

Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes