CVOct 21, 2024

Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding

arXiv:2410.15615v19 citationsh-index: 7ICPR
Originality Incremental advance
AI Analysis

It addresses the problem of inefficient and inaccurate object localization in 3D scenes for applications like robotics and augmented reality, representing an incremental improvement over existing methods.

This paper tackles 3D visual grounding by proposing a joint top-down and bottom-up framework to locate objects in 3D point clouds based on text descriptions, achieving state-of-the-art performance on the ScanRefer benchmark.

This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions. Existing methods fall into two categories: top-down and bottom-up methods. Top-down methods rely on a pre-trained 3D detector to generate and select the best bounding box, resulting in time-consuming processes. Bottom-up methods directly regress object bounding boxes with coarse-grained features, producing worse results. To combine their strengths while addressing their limitations, we propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency. Specifically, in the first stage, we propose a bottom-up based proposal generation module, which utilizes lightweight neural layers to efficiently regress and cluster several coarse object proposals instead of using a complex 3D detector. Then, in the second stage, we introduce a top-down based proposal consolidation module, which utilizes graph design to effectively aggregate and propagate the query-related object contexts among the generated proposals for further refinement. By jointly training these two modules, we can avoid the inherent drawbacks of the complex proposals in the top-down framework and the coarse proposals in the bottom-up framework. Experimental results on the ScanRefer benchmark show that our framework is able to achieve the state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes