CVAIJun 19, 2024

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

arXiv:2406.13763v27 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of spatio-temporal social reasoning for AI systems, but it is incremental as it extends existing text-based theory-of-mind methods to video data.

The paper tackled the problem of enabling large multimodal models to perform human-like theory-of-mind reasoning in dynamic video scenes, resulting in a pipeline that retrieves key frames to answer explicit probing questions about social and emotional content.

Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes