From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
This work addresses the data scarcity problem for researchers and practitioners in 3D vision-language tasks, offering a novel method to bridge the gap between 2D and 3D models, though it is incremental in building upon existing 2D foundation models.
The paper tackles the data bottleneck in 3D vision-language grounding by introducing LIFT-GS, a distillation technique that uses differentiable rendering to leverage 2D supervision from foundation models, achieving state-of-the-art results such as 25.7% mAP on open-vocabulary instance segmentation and 10-30% improvements on referential grounding tasks.
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with $25.7\%$ mAP on open-vocabulary instance segmentation (vs. $20.2\%$ prior SOTA) and consistent $10-30\%$ improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io