CVSep 28, 2023

Object Motion Guided Human Motion Synthesis

Stanford
arXiv:2309.16237v1217 citationsh-index: 77
Originality Incremental advance
AI Analysis

This work addresses a specific challenge in character animation and embodied AI for generating realistic human-object interactions, but it is incremental as it builds on existing diffusion models with a novel intermediate representation.

The paper tackles the problem of full-body human motion synthesis for manipulating large objects by proposing OMOMO, a conditional diffusion framework that generates manipulation behaviors from object motion alone, achieving physically plausible motions through explicit contact constraints and demonstrating effectiveness in experiments with generalization to unseen objects.

Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes