CVDec 23, 2025

UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis

arXiv:2512.20107v1h-index: 13
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and high-quality view synthesis for computer vision applications, representing an incremental improvement by combining existing paradigms.

The paper tackled novel view synthesis by proposing a hybrid framework that unifies deterministic rendering and masked autoregressive models, achieving state-of-the-art image quality and reducing rendering time by an order of magnitude compared to generative baselines.

Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes