DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
For end-to-end autonomous driving, this work improves long-horizon world modeling and reasoning in challenging scenarios, though it is an incremental improvement over existing VLM-based methods.
DeepSight proposes a driving world model that predicts latent semantic features in BEV space for future frames, enabling long-horizon modeling, and integrates adaptive text reasoning for long-tail scenarios, achieving SOTA on the Bench2drive benchmark.
End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.