Dense Object Grounding in 3D Scenes
This work addresses a practical limitation in 3D object grounding for applications like robotics and autonomous driving by enabling more contextualized, multi-object localization from complex descriptions.
The paper tackles the problem of localizing multiple objects in 3D scenes using paragraph-level natural language descriptions, rather than single sentences, and introduces a new task called 3D Dense Object Grounding (3D DOG). It proposes a Stacked Transformer framework (3DOGSFormer) that leverages semantic and spatial relationships among objects, achieving significant performance improvements over state-of-the-art methods on benchmarks like Nr3D, Sr3D, and ScanRefer.
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and autonomous driving. However, the majority of existing 3D object grounding methods are restricted to a single-sentence input describing an individual object, which cannot comprehend and reason more contextualized descriptions of multiple objects in more practical 3D cases. To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Instead of naively localizing each sentence-guided object independently, we found that dense objects described in the same paragraph are often semantically related and spatially located in a focused region of the 3D scene. To explore such semantic and spatial relationships of densely referred objects for more accurate localization, we propose a novel Stacked Transformer based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a contextual query-driven local transformer decoder to generate initial grounding proposals for each target object. Then, we employ a proposal-guided global transformer decoder that exploits the local object features to learn their correlation for further refining initial grounding proposals. Extensive experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.