Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

arXiv:2606.0426981.1

Predicted impact top 17% in RO · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the challenge of deformable object manipulation for robotics, offering a method that generalizes across diverse folding modes from a single demonstration without retraining.

Instant-Fold enables a robot to manipulate deformable objects (e.g., folding clothes) from a single human demonstration, without gradient updates, by using in-context imitation learning. It achieves zero-shot transfer from simulation to real-world folding tasks.

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

View on arXiv PDF

Similar