Unleashing Guidance Without Classifiers for Human-Object Interaction Animation
This work addresses the problem of realistic human-object interaction animation for computer graphics and robotics applications, representing an incremental improvement over prior diffusion-based approaches.
The paper tackles the challenge of generating realistic human-object interaction animations by proposing LIGHT, a data-driven diffusion method that eliminates the need for hand-crafted contact priors through pace-induced guidance, achieving higher contact fidelity and stronger generalization to unseen objects and tasks.
Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.