Learning Predictive Visuomotor Coordination
It addresses a key challenge in robotics and human-computer interaction by improving predictive modeling of human behavior, though it appears incremental as it extends existing diffusion-based methods.
This work tackled the problem of predicting human visuomotor coordination, such as head pose and gaze, from egocentric visual and kinematic data, achieving strong generalization on the EgoExo4D dataset.
Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.