CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
This work addresses the problem of flexible and efficient 3D novel view synthesis for applications like world modeling, representing an incremental improvement over existing methods.
The paper tackles the limitations of non-autoregressive multi-view diffusion models in 3D novel view synthesis by proposing CausNVS, an autoregressive model that supports arbitrary input-output view configurations and generates views sequentially, achieving consistently strong visual quality across diverse settings.
Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.