CVDec 10, 2025

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

arXiv:2512.09646v13 citationsh-index: 33
Originality Incremental advance
AI Analysis

This addresses the challenge of creating controllable videos for applications like animation or simulation, though it is incremental as it builds on existing video diffusion models.

The paper tackles the problem of generating realistic and controllable videos of human-object interactions from sparse trajectories by proposing VHOI, a two-stage framework that densifies trajectories into masks and fine-tunes a video diffusion model, achieving state-of-the-art results.

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes