CVMar 5, 2024

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

arXiv:2403.03077v433 citationsh-index: 24CVPR
Originality Highly original
AI Analysis

This addresses accuracy and interpretability issues in 3D visual grounding for applications like robotics and AR/VR, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the problem of 3D visual grounding, where existing methods struggle with accuracy and complex linguistic queries, by proposing the MiKASA Transformer, which achieves the highest overall accuracy in the Referit3D challenge for Sr3D and Nr3D datasets, with large improvements in viewpoint-dependent categories.

3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes