GRS: Generating Robotic Simulation Tasks from Real-World Images
This addresses the real-to-sim gap for robotics researchers, though it appears incremental as it builds on existing vision-language models and segmentation techniques.
The paper tackles the problem of creating digital twin simulations from real-world images for robotic training, introducing GRS which uses vision-language models to generate solvable tasks and achieves effectiveness in object correspondence and task environment generation.
We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism.