CVFeb 24

RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

Berkeley
arXiv:2602.20685v23 citationsh-index: 41
Originality Incremental advance
AI Analysis

This addresses the challenge of building robust world models for autonomous driving that generalize across diverse camera setups without explicit 3D geometry, though it is incremental in combining scale-temporal autoregression with existing attention mechanisms.

The paper tackles the problem of simulating real-world evolution in driving scenarios by proposing RAYNOVA, a geometry-agnostic multiview world model that achieves state-of-the-art multi-view video generation results on nuScenes, with higher throughput and strong controllability.

World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova-ai.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes