ROMar 13

AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation

arXiv:2512.1839677.51 citations
AI Analysis

This addresses the need for scalable, low-cost demonstration data in robotics, particularly for fine-grained manipulation tasks, though it is incremental as it builds on existing VLA and world-model methods.

The paper tackles the problem of generating photorealistic, physics-consistent training data for articulated object manipulation, which traditionally requires costly real demonstrations. It presents AOMGen, a framework that synthesizes diverse data from a single real scan, increasing VLA policy success rates from 0% to 88.7% on unseen objects and layouts.

Recent advances in Vision-Language-Action (VLA) and world-model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine-grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi-view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine-tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes