Language-to-Space Programming for Training-Free 3D Visual Grounding
This addresses the challenge of high annotation costs and inefficiency in training-free methods for 3D visual grounding, offering a balanced solution for applications in robotics or AR/VR.
The paper tackled the problem of 3D visual grounding without training data by introducing Language-to-Space Programming (LaSP), which uses LLM-generated codes to analyze spatial relations, achieving 52.9% accuracy on the Nr3D benchmark and reducing time and token costs.
3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.