CLJul 6, 2025

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio

arXiv:2507.04415v212.04 citationsh-index: 39EMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better social understanding in AI for building socially intelligent agents, but it is incremental as it focuses on benchmarking rather than novel solutions.

The authors tackled the problem of assessing Theory of Mind (ToM) in multimodal AI by introducing MoMentS, a benchmark with over 2,300 questions across seven categories, and found that while vision improves performance, models struggle with multimodal integration and audio processing does not consistently outperform text inputs.

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MoMentS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively. For audio, models that process dialogues as audio do not consistently outperform transcript-based inputs. Our findings highlight the need to improve multimodal integration and point to open challenges that must be addressed to advance AI's social understanding.

View on arXiv PDF

Similar