VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving
This work addresses the challenge of open-world generalization for autonomous driving systems, which is crucial for real-world deployment but often incremental in approach.
The authors tackled the problem of end-to-end autonomous driving in unstructured outdoor environments by introducing VLA-R, a framework that integrates open-world perception with vision-action retrieval, demonstrating strong generalization and exploratory performance in unseen environments with limited data.
Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.