Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description
This addresses the underrepresented task of understanding interactable and articulated objects in 3D scenes for applications in mixed reality, wearable computing, and embodied AI, though it appears incremental as it builds on existing datasets and methods.
The authors tackled the problem of understanding articulated objects in 3D scenes by introducing Articulate3D, a dataset with 280 indoor scenes and 8 annotation types, and USDNet, a unified framework that predicts part segmentation and motion attributes, demonstrating advantages in evaluations on multiple datasets and downstream applications like scene editing and robotic policy training.
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets and algorithms approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered in the research field. In this work, we address this shortcoming by introducing: (1) Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes. Articulate3D provides 8 types of annotations for articulated objects, covering parts and detailed motion information, all stored in a standardized scene representation format designed for scalable 3D content creation, exchange and seamless integration into simulation environments. (2) USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects. We evaluate USDNet on Articulate3D as well as two existing datasets, demonstrating the advantage of our unified dense prediction approach. Furthermore, we highlight the value of Articulate3D through cross-dataset and cross-domain evaluations and showcase its applicability in downstream tasks such as scene editing through LLM prompting and robotic policy training for articulated object manipulation. We provide open access to our dataset, benchmark, and method's source code.