CVMar 25, 2025

VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs

Kelaiti Xiao, Liang Yang, Paerhati Tulajiang, Hongfei Lin

arXiv:2503.19936v13.6h-index: 15PRCV

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for researchers in multimodal AI to assess and improve LLMs on non-traditional visual tasks, though it is incremental as it builds on existing evaluation frameworks.

The paper tackled the problem of evaluating visual recognition in large language models (LLMs) by introducing VisualQuest, a diverse image dataset with stylized imagery, and found significant performance variations among state-of-the-art models, highlighting the need for factual knowledge and inferential capabilities.

This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models (LLMs) to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements, requiring the integration of domain-specific knowledge and advanced reasoning. The dataset was meticulously curated through multiple stages of filtering, annotation, and standardization to ensure high quality and diversity. Our evaluations using several state-of-the-art multimodal LLMs reveal significant performance variations that underscore the importance of both factual background knowledge and inferential capabilities in visual recognition tasks. VisualQuest thus provides a robust and comprehensive benchmark for advancing research in multimodal reasoning and model architecture design.

View on arXiv PDF

Similar