Grounding Social Perception in Intuitive Physics

Lance Ying, Aydan Y. Huang, Aviv Netanyahu, Andrei Barbu, Boris Katz, Joshua B. Tenenbaum, Tianmin Shu

arXiv:2603.2741098.1h-index: 21

AI Analysis

Provides a computational account for how humans integrate physics and psychology to understand social interactions, addressing a key bottleneck in social perception AI.

The authors propose that social perception is grounded in intuitive physics and psychology, introducing PHASE dataset and SIMPLE model. SIMPLE achieves high accuracy and human-level agreement in inferring agents' goals and relations, outperforming feedforward and physics-agnostic baselines.

People infer rich social information from others' actions. These inferences are often constrained by the physical world: what agents can do, what obstacles permit, and how the physical actions of agents causally change an environment and other agents' mental states and behavior. We propose that such rich social perception is more than visual pattern matching, but rather a reasoning process grounded in an integration of intuitive psychology with intuitive physics. To test this hypothesis, we introduced PHASE (PHysically grounded Abstract Social Events), a large dataset of procedurally generated animations, depicting physically simulated two-agent interactions on a 2D surface. Each animation follows the style of the Heider and Simmel movie, with systematic variation in environment geometry, object dynamics, agent capacities, goals, and relationships (friendly/adversarial/neutral). We then present a computational model, SIMPLE, a physics-grounded Bayesian inverse planning model that integrates planning, probabilistic planning, and physics simulation to infer agents' goals and relations from their trajectories. Our experimental results showed that SIMPLE achieved high accuracy and agreement with human judgments across diverse scenarios, while feedforward baseline models -- including strong vision-language models -- and physics-agnostic inverse planning failed to achieve human-level performance and did not align with human judgments. These results suggest that our model provides a computational account for how people understand physically grounded social scenes by inverting a generative model of physics and agents.

View on arXiv PDF

Similar