CVAIMar 16

MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

arXiv:2603.1468699.41 citationsh-index: 12
AI Analysis

This work addresses a frontier in expressive digital human creation for applications like animation and virtual reality, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the problem of generating realistic Human-Object Interaction (HOI) videos with complex non-planar motions, such as out-of-plane reorientation, by proposing MVHOI, a two-stage framework that uses a 3D Foundation Model and a controllable video generation model to achieve superior performance in long-duration HOI video synthesis.

Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes