CVJun 4

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

arXiv:2606.0648556.1
AI Analysis

For embodied AI and 3D scene understanding, this work addresses the limitation of object-centric models by enabling part-level reasoning, which is crucial for fine-grained interaction.

PAR3D introduces a part-aware 3D-MLLM framework that improves fine-grained part understanding in 3D scenes, achieving substantial gains in part-level question answering and referring segmentation while maintaining strong object-level performance.

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes