ROMay 17

Real2Sim via Active Perception with Behavior Trees Automatically Generated by VLMs

arXiv:2601.0845438.11 citationsh-index: 4
Predicted impact top 55% in RO · last 90 daysOriginality Incremental advance
AI Analysis

For robotics and simulation, this provides a scalable, interpretable pipeline to build physics-aware digital twins from unstructured human intent, reducing redundant interactions.

This work presents an autonomous Real2Sim framework that uses VLMs to decompose high-level tasks and generate Behavior Trees for selectively acquiring missing physical parameters via robotic interaction. Experiments on a Franka Emika Panda show accurate estimation of mass, geometry, and friction with significant efficiency gains over exhaustive baselines.

Constructing physically accurate simulation environments (Real2Sim) traditionally relies on manual system identification or rigid, exhaustive exploration routines. These task-agnostic pipelines often fail to leverage semantic scene context, leading to redundant physical interactions and inefficient data acquisition. In this paper, we present an autonomous, intent-driven Real2Sim framework that leverages Vision-Language Models (VLMs) for Semantic Task Decomposition. Given a high-level natural language request, an incomplete simulation description, and a visual observation, the framework autonomously identifies the minimal subset of missing physical parameters required for the simulation task. It then generates a reactive Behavior Tree (BT) composed of atomic motion and sensing primitives to selectively acquire these parameters through contact-rich robotic interaction. Extensive real-world experiments on a torque-controlled Franka Emika Panda demonstrate that our approach accurately estimates object mass, surface geometry, and derived parameters such as friction. Quantitative evaluations reveal significant operational efficiency gains compared to exhaustive baseline methods, while ablation studies confirm the robustness of the prompt architecture across different state-of-the-art VLMs. Furthermore, the reactive hierarchy of the BT acts as a deterministic safety filter, successfully mitigating generative VLM hallucinations and preventing unsafe physical anomalies. Ultimately, this work provides a scalable, efficient, and interpretable pipeline for building physics-aware digital twins directly from unstructured human intent.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes