ROCVLGSep 24, 2025

VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

arXiv:2509.20322v233 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the problem of generalizing humanoid robot control across diverse tasks without external motion capture, though it appears incremental as it builds on existing sim-to-real and hierarchical control methods.

The paper tackles humanoid robot loco-manipulation in unstructured environments by introducing VisualMimic, a visual sim-to-real framework that achieves zero-shot transfer to real robots, enabling tasks like box lifting and football dribbling in both lab and outdoor settings.

Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: https://visualmimic.github.io .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes