CVMay 3, 2025

3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

arXiv:2505.01809v13 citationsh-index: 10ICRA
Originality Incremental advance
AI Analysis

This work addresses 3D visual grounding for robotics or AR/VR applications, but it is incremental as it builds on existing weakly-supervised methods with specific improvements.

The paper tackles the problem of localizing 3D objects in point clouds using natural language descriptions without direct annotations, addressing challenges like category-level ambiguity and instance-level complexity, and achieves state-of-the-art performance on benchmarks such as Nr3D, Sr3D, and ScanRef.

The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes