CVAIMar 19

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

arXiv:2603.1889271.21 citationsh-index: 5
AI Analysis

This addresses the problem of inadequate spatial reasoning benchmarks for Vision-Language Models, particularly for real-world deployment, though it is incremental as it builds on existing benchmark efforts.

The paper tackles the lack of benchmarks for multi-hop compositional spatial reasoning in Vision-Language Models by introducing MultihopSpatial, a benchmark with complex queries and a new metric (Acc@50IoU), and shows that current VLMs struggle with this challenge, with reinforcement learning post-training improving performance.

Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes