CVMay 20

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

arXiv:2605.2094272.1

Predicted impact top 40% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For autonomous driving researchers, this work provides a method to improve VLMs' structured road reasoning without large-scale data or models, though it is incremental as it combines existing graph and language approaches.

The paper introduces the Combined Road Substrate (CRS), a graph-grounded framework that integrates geometric road structure with open-vocabulary semantics for autonomous driving. Training small 2-4B parameter models on as few as 20-80 CRS-enriched scenes yields stable gains in compositional reasoning, showing that the primary bottleneck is structured supervision, not model scale.

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

View on arXiv PDF

Similar