CVApr 8, 2024

GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

Zoltán Á. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando de la Torre, László A. Jeni

arXiv:2405.18438v16.53 citationsh-index: 29WACV

Originality Highly original

AI Analysis

This addresses the challenge of accurate motion grounding in multimodal contexts for applications like robotics or animation, representing a strong specific gain rather than a foundational advance.

The paper tackles the problem of generating human motion grounded in 3D scenes and text by integrating an open vocabulary scene encoder, achieving up to a 30% reduction in goal object distance compared to prior state-of-the-art on the HUMANISE dataset.

The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently, the scene encoder is fine-tuned for conditional motion generation, incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework and a perceptual study. Additionally, our method is designed to seamlessly accommodate future 2D segmentation methods that provide per-pixel text-aligned features for distillation.

View on arXiv PDF

Similar