CVJun 4

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang, Wentao Zhu

arXiv:2606.0589673.9

Predicted impact top 37% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in social AI and digital humans, this work unifies cognitive reasoning and multimodal generation in a closed-loop system, demonstrating that explicit mental state inference can improve dialogue quality.

The paper proposes a closed-loop dual-agent framework that integrates perception, social reasoning (via Theory of Mind), and multimodal expression to create lifelike digital humans with social intelligence. Experiments show competitive or superior performance on dialogue quality and video generation metrics, surpassing a full-information baseline on key dialogue dimensions.

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.

View on arXiv PDF

Similar