CVAIMar 28, 2025

How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark

arXiv:2503.22093v21 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the need to assess VLMs' reasoning about human mental states for applications in human-AI interaction, though it is incremental as it builds on existing VQA and ToM evaluation frameworks.

The paper tackled the problem of evaluating Vision-Language Models' ability to understand human intentions in Theory of Mind tasks, finding that GPT-4 outperformed others with only GPT-4o-mini achieving comparable performance, and that models often struggle in complex scenarios like bullying or cheating.

Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; however, their ability to perform Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and mental states, remains underexplored. We propose an open-ended question framework to evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset of 30 images and evaluated the performance of four VLMs of varying sizes. Our results show that the GPT-4 model outperformed all the others, with only one smaller model, GPT-4o-mini, achieving comparable performance. We observed that VLMs often struggle to infer intentions in complex scenarios such as bullying or cheating. Our findings reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues. The dataset is available at https://github.com/ximingwen/ToM-AAAI25-Multimodal.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes