CLCVCYDec 19, 2025

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

arXiv:2512.17394v2h-index: 6
Originality Incremental advance
AI Analysis

This work addresses the need for cross-cultural social reasoning benchmarks in AI, highlighting persistent limitations in VLM performance for robust, visually grounded understanding.

The paper tackles the problem of evaluating Vision-Language Models (VLMs) for Theory of Mind (ToM) reasoning across diverse cultural contexts, revealing that while frontier models achieve high accuracy (>93%) on a new benchmark, they struggle with false belief reasoning (19-83% accuracy) and exhibit social desirability bias.

Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring semantically positive answer choices over negative ones. Ablation experiments reveal that some frontier models rely heavily on parametric social priors, frequently defaulting to safety-aligned predictions. Furthermore, while Chain-of-Thought prompting aids older models, it yields minimal gains for newer ones. Overall, our work provides a testbed for cross-cultural social reasoning, underscoring that despite architectural gains, achieving robust, visually grounded understanding remains an open challenge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes