CVJun 3, 2025

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu

arXiv:2506.02444v314.46 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of generating physically plausible and visually consistent HOI sequences for applications in robotics and virtual reality, though it appears incremental by combining existing diffusion techniques with new constraints.

The paper tackles the problem of generating hand-object interaction (HOI) sequences by proposing a synchronized diffusion framework that simultaneously produces videos and 3D motions, eliminating reliance on predefined object models and improving consistency, with experimental results showing superiority over state-of-the-art methods in fidelity and generalization.

Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo_project.

View on arXiv PDF Code

Similar