CLCVMar 9, 2025

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

arXiv:2503.06492v14 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation tools in multimodal AI to improve fact-seeking capabilities, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of non-factual responses in large vision-language models for fact-seeking question answering by introducing VisualSimpleQA, a benchmark that enables decoupled evaluation of visual and linguistic modules, with experiments showing state-of-the-art models achieving only 60% correctness on the main dataset and 30% on a challenging subset.

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes