CLApr 15, 2025

Benchmarking Vision Language Models on German Factual Data

arXiv:2504.11108v24 citationsh-index: 1AIAI
Originality Synthesis-oriented
AI Analysis

This work addresses the underrepresentation of German in VLMs, highlighting performance gaps for non-English languages, but is incremental as it benchmarks existing models without proposing new methods.

The study evaluated open-weight vision language models on factual knowledge in German and English, finding that models struggle with visual cognition for German-specific images like celebrities and sights, and often fail in German language for animals and plants, while performing equally well for cars and supermarket products.

Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes