CVMar 24, 2025

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

arXiv:2503.18349v27 citationsh-index: 16
Originality Highly original
AI Analysis

This work addresses scalability and generalizability challenges in animation, simulation, and robotics by automating reward design for human-object interactions.

The paper tackles the problem of synthesizing human-object interactions without expensive motion capture or manual reward engineering by introducing a physics-based framework that uses Vision-Language Models to guide motion policies, achieving state-of-the-art performance in generating natural motions across diverse object types.

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes