CVDec 27, 2025

Visual Autoregressive Modelling for Monocular Depth Estimation

arXiv:2512.22653v12 citationsh-index: 58Has Code
Originality Incremental advance
AI Analysis

This work addresses depth estimation for 3D vision tasks, offering a complementary approach with advantages in data scalability and adaptability, though it is incremental as it adapts existing VAR models.

The authors tackled monocular depth estimation by proposing a visual autoregressive (VAR) method as an alternative to diffusion-based approaches, achieving state-of-the-art performance on indoor benchmarks under constrained training and strong results on outdoor datasets with only 74K synthetic samples for fine-tuning.

We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes