Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving
This addresses the challenge of sub-optimal planning under constrained resources for autonomous driving systems, representing a strong specific gain rather than a foundational breakthrough.
The paper tackles the problem of inefficient world-model-based planners in autonomous driving by introducing Latent-WAM, an end-to-end framework that achieves state-of-the-art trajectory planning with 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, using less training data and a compact 104M-parameter model.
We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.