Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention
This addresses the challenge of locating 3D objects based on language queries, with applications in visual understanding and robotics, representing an incremental improvement over existing methods.
The paper tackled the problem of multi-object 3D grounding from point clouds by introducing D-LISA, a two-stage method with dynamic modules and language-informed spatial attention, which outperformed state-of-the-art methods by 12.8% in multi-object grounding.
Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.