CVMay 22, 2025

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

arXiv:2505.16805v122 citationsh-index: 26CVPR
Originality Incremental advance
AI Analysis

This work addresses efficient and interpretable decision-making for autonomous vehicles, representing an incremental advancement in combining VLMs with end-to-end models.

The paper tackles the challenge of integrating Vision-Language Models (VLMs) with end-to-end models for autonomous driving planning by introducing SOLVE, which uses a shared visual encoder and Trajectory Chain-of-Thought to refine trajectory predictions, resulting in significant improvements in accuracy on the nuScenes dataset.

The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes