CV AI LGOct 14, 2025

VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

A. Alfarano, L. Venturoli, D. Negueruela del Castillo

arXiv:2510.12750v110.23 citationsHas Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This addresses the problem of evaluating deep visual reasoning in AI models for researchers and practitioners in multimodal AI, though it is incremental as it builds on existing VQA benchmarks.

The authors tackled the lack of deep semantic evaluation in Visual Question Answering (VQA) benchmarks for complex domains like art by introducing VQArt-Bench, a large-scale benchmark for cultural heritage, and found that 14 state-of-the-art MLLMs show significant limitations, including weaknesses in simple counting tasks and a performance gap between proprietary and open-source models.

Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.

View on arXiv PDF

Similar