CVJan 9

SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

arXiv:2601.05640v212 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the challenge of safe trajectory planning in autonomous driving by enhancing VLM capabilities, representing an incremental improvement through hierarchical structuring of existing models.

The paper tackles the problem of adapting generalist Vision-Language Models (VLMs) to autonomous driving by proposing SGDrive, a framework that structures VLM representation learning into a scene-agent-goal hierarchy, achieving state-of-the-art performance on the NAVSIM benchmark among camera-only methods.

Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes