CV CLNov 21, 2025

Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

arXiv:2511.17004v36.22 citations

Originality Incremental advance

AI Analysis

This addresses a critical challenge for multicultural societies by revealing that visual cultural concept mixing can substantially impair state-of-the-art VLMs, highlighting the need for more culturally robust multimodal understanding.

The study tackled the problem of Vision-Language Models (VLMs) lacking stability across diverse cultural inputs by introducing ConfusedTourist, a cultural adversarial robustness suite, and found that accuracy drops heavily under simple image-stacking perturbations, with even worse performance in image-generation-based variants.

Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

View on arXiv PDF

Similar