CVAIRODec 18, 2025

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

arXiv:2512.16907v24 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the challenge of predicting 3D hand trajectories in egocentric human interaction videos, which is incremental by improving upon prior datasets and models that weakly link reasoning and action.

The paper tackles the problem of 3D hand trajectory prediction by introducing the EgoMAN dataset with 219K trajectories and 3M QA pairs, and the EgoMAN model, which links reasoning to motion, resulting in accurate and stage-aware trajectories with generalization across real-world scenes.

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes