CVMay 19

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

arXiv:2605.1972727.2
Predicted impact top 28% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For 3D vision researchers, Tango3D addresses the lack of fine-grained 2D-3D correspondence in existing foundation models, enabling dense downstream tasks.

Tango3D introduces a 3D foundation model that achieves both dense pixel-to-point correspondence and global cross-modal retrieval, a capability not offered by existing models. It maintains competitive global retrieval while enabling fine-grained alignment.

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes