CVMar 14, 2025

Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, Xiao-Ping Zhang

arXiv:2503.11094v428.915 citationsh-index: 7Has CodeMM

Originality Incremental advance

AI Analysis

This work addresses a gap in benchmarking spatial reasoning for MLLMs in aerial contexts, providing a tool for researchers, but it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating spatial reasoning in multimodal large language models (MLLMs) in open aerial environments by introducing Open3D-VQA, a benchmark with 73k QA pairs across 7 tasks, and found that models perform better on relative spatial relations than absolute distances, 3D LLMs do not outperform 2D LLMs, and fine-tuning on simulated data improves real-world performance.

Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.

View on arXiv PDF Code

Similar