Dangen She

CV
h-index12
3papers
6citations
Novelty63%
AI Score46

3 Papers

ROMar 28
Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

Qiqi Liu, Huan Xu, Jingyu Li et al.

Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.

CVAug 31, 2025
OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving

Pei Liu, Qingtian Ning, Xinyan Lu et al.

Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.

CVJul 31, 2025
MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting

Xingyue Peng, Yuandong Lyu, Lang Zhang et al.

Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.