CVAIMay 18

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

arXiv:2605.1820970.6
Predicted impact top 42% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in spatial video understanding, this work provides a training-free method to improve VLM performance on zero-shot spatial reasoning, though gains are modest (up to 5%).

SpatioRoute introduces a dynamic prompt routing method for zero-shot spatial question answering over egocentric video, achieving up to 5% accuracy gains over fixed prompts and setting a new state-of-the-art on SQA3D without 3D inputs.

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes