KINet: Unsupervised Forward Models for Robotic Pushing Manipulation
This addresses the challenge of robotic manipulation in unstructured environments by enabling unsupervised learning of physical interactions, though it is incremental in improving existing unsupervised methods.
The paper tackles the problem of learning object-centric forward models for robotic pushing without supervision, introducing KINet which learns keypoint representations and predicts future states, achieving accurate forward prediction and generalization to new scenarios.
Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations for downstream robotic pushing manipulation tasks.