CV ROMay 27

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong

arXiv:2605.2907495.3h-index: 3

Predicted impact top 8% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This benchmark fills a critical gap for evaluating and improving interaction-aware spatial intelligence in VLMs for embodied AI applications.

Embodied3DBench introduces a benchmark with 21k QA pairs across 6 tasks to evaluate low-level spatial intelligence in VLMs for embodied 3D environments. Results show models excel at high-level spatial reasoning but struggle with interaction-oriented perception, and fine-tuning on a synthesized 1.3M QA dataset significantly improves performance.

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

View on arXiv PDF

Similar