ROJun 4

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

arXiv:2606.0564582.3
AI Analysis

For autonomous driving researchers, this work provides a principled framework for action-conditioned world modeling and decision-making, though it is incremental over existing discrete diffusion approaches.

Discrete-WAM introduces a unified discrete token representation for vision and actions, enabling compositional causal reasoning in autonomous driving. It achieves competitive performance on large-scale benchmarks while supporting controllable generation and counterfactual reasoning.

Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures. We introduce Discrete-WAM, a unified latent vision-action world policy that represents future visual states and ego actions as aligned discrete tokens, enabling compositional causal reasoning across alternative futures. Built upon this unified discrete alignment, Discrete-WAM establishes a shared discrete diffusion framework with unified generative tasks, jointly formulating world modeling, world-action policy, and hierarchical decision-enabled policy, supporting compositional generalization across diverse driving scenarios. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves competitive performance while supporting controllable generation and counterfactual reasoning, offering a principled path toward more reliable decision-making.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes