CVMar 10

DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

arXiv:2603.09883v178.2h-index: 11
Predicted impact top 31% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the challenge of limited control and consistency in human-centric video generation for applications requiring intuitive user interaction, though it is incremental in improving existing methods.

The paper tackles the problem of generating controllable and physically consistent Human-Object Interaction (HOI) videos by introducing DISPLAY, which uses sparse motion guidance (wrist joint coordinates and object bounding boxes) to improve flexibility and generalization, achieving high-fidelity results across diverse tasks.

Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes