CVMar 14, 2025

Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

arXiv:2503.11094v415 citationsh-index: 7Has CodeMM
Originality Incremental advance
AI Analysis

This work addresses a gap in benchmarking spatial reasoning for MLLMs in aerial contexts, providing a tool for researchers, but it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating spatial reasoning in multimodal large language models (MLLMs) in open aerial environments by introducing Open3D-VQA, a benchmark with 73k QA pairs across 7 tasks, and found that models perform better on relative spatial relations than absolute distances, 3D LLMs do not outperform 2D LLMs, and fine-tuning on simulated data improves real-world performance.

Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes