Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
This addresses the problem of scalable dexterous manipulation for robotics by reducing the need for expensive real-world data or task-specific simulation designs.
The paper tackles the challenge of learning generalist policies for dexterous manipulation by proposing Dex4D, a framework that learns a task-agnostic policy in simulation to manipulate objects to any desired pose, enabling zero-shot transfer to real-world tasks without finetuning. The method shows consistent improvements over baselines and strong generalization to novel objects, scenes, and trajectories.
Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.