Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
This addresses the challenge of generalizable robot manipulation for robotics, reducing reliance on large demonstration datasets, though it is incremental as it builds on existing video-based learning methods.
The paper tackles the problem of enabling zero-shot robot manipulation with unseen objects and scenes by leveraging web videos to predict interaction plans and learning a task-agnostic transformation to robot actions, achieving diverse generalizable manipulation results.
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/