CVAILGMMROJun 26, 2025

Whole-Body Conditioned Egocentric Video Prediction

arXiv:2506.21552v121 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the challenge of modeling complex real-world environments and embodied agent behaviors from a human perspective, representing an initial attempt in this domain.

The paper tackles the problem of predicting egocentric video from human actions by conditioning on 3D body pose trajectories, using a conditional diffusion transformer trained on a large-scale real-world dataset. The result is a model that simulates how actions shape the environment from a first-person perspective, evaluated through a hierarchical protocol to assess embodied prediction and control abilities.

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes