CL CV CYDec 19, 2025

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Zabir Al Nazi, GM Shahariar, Md. Abrar Hossain, Wei Peng

arXiv:2512.17394v22.7h-index: 6

Originality Incremental advance

AI Analysis

This work addresses the need for cross-cultural social reasoning benchmarks in AI, highlighting persistent limitations in VLM performance for robust, visually grounded understanding.

The paper tackles the problem of evaluating Vision-Language Models (VLMs) for Theory of Mind (ToM) reasoning across diverse cultural contexts, revealing that while frontier models achieve high accuracy (>93%) on a new benchmark, they struggle with false belief reasoning (19-83% accuracy) and exhibit social desirability bias.

Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring semantically positive answer choices over negative ones. Ablation experiments reveal that some frontier models rely heavily on parametric social priors, frequently defaulting to safety-aligned predictions. Furthermore, while Chain-of-Thought prompting aids older models, it yields minimal gains for newer ones. Overall, our work provides a testbed for cross-cultural social reasoning, underscoring that despite architectural gains, achieving robust, visually grounded understanding remains an open challenge.

View on arXiv PDF

Similar