AI CL CVJun 13, 2025

VLM@school -- Evaluation of AI image understanding on German middle school knowledge

arXiv:2506.11604v23 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses the need for more realistic, non-English benchmarks to stress-test VLMs, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating Vision Language Models (VLMs) on tasks requiring visual reasoning and subject-specific knowledge in German, using a benchmark dataset based on middle school curricula, and finds that even top models achieve less than 45% accuracy, with poor performance in areas like music and mathematics.

This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.

View on arXiv PDF

Similar