RO AI CLSep 24, 2025

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang

arXiv:2509.20109v116 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work provides a scalable and reliable solution for autonomous driving systems by enhancing safety in trajectory generation, though it is incremental as it builds on existing diffusion and VLA paradigms.

The paper tackles the problem of generating safe trajectories for autonomous driving by addressing limitations in existing Vision-Language-Action models, such as reliance on imitation learning and computationally expensive methods, and introduces ReflectDrive, which uses discrete diffusion and a safety-aware reflection mechanism to achieve significant improvements in safety-critical trajectory generation on the NAVSIM benchmark.

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

View on arXiv PDF

Similar