RO CVFeb 11

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Sining Ang, Yuguang Yang, Chenxu Dang, Canyu Chen, Cheng Chi, Haiyan Liu, Xuanyao Mao, Jason Bao, Xuliang, Bingchuan Sun, Yan Wang

arXiv:2602.10719v14.01 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses performance optimization for autonomous driving systems by synergizing different backbone architectures, representing an incremental improvement over existing methods.

The paper tackles the problem of improving end-to-end driving systems by analyzing the complementary behaviors of vision-only and vision-language model backbones, finding that each excels in different scenarios (winning on 2-3% of test cases). The proposed HybridDriveVLA method combines both backbones with a learned scorer to achieve 92.10 PDMS, while DualDriveVLA uses a fast-slow policy to achieve 91.00 PDMS with 3.2x throughput improvement.

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

View on arXiv PDF

Similar