ROAINov 2, 2025

URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model

arXiv:2511.00940v111 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work provides an efficient solution for robotic simulation and embodied AI, enabling better sim-to-real transfer, though it appears incremental as it builds on existing multimodal and segmentation methods.

The paper tackles the problem of automatically constructing digital twins of articulated objects for robotic simulation by proposing URDF-Anything, an end-to-end framework based on a 3D multimodal language model, which achieves improvements such as a 17% increase in mIoU for geometric segmentation and a 29% reduction in kinematic parameter error.

Constructing accurate digital twins of articulated objects is essential for robotic simulation training and embodied AI world model building, yet historically requires painstaking manual modeling or multi-stage pipelines. In this work, we propose \textbf{URDF-Anything}, an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM). URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction. It implements a specialized $[SEG]$ token mechanism that interacts directly with point cloud features, enabling fine-grained part-level segmentation while maintaining consistency with the kinematic parameter predictions. Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches regarding geometric segmentation (mIoU 17\% improvement), kinematic parameter prediction (average error reduction of 29\%), and physical executability (surpassing baselines by 50\%). Notably, our method exhibits excellent generalization ability, performing well even on objects outside the training set. This work provides an efficient solution for constructing digital twins for robotic simulation, significantly enhancing the sim-to-real transfer capability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes