CVNov 13, 2025

Depth Anything 3: Recovering the Visual Space from Any Views

arXiv:2511.10647v1292 citationsh-index: 23
Originality Incremental advance
AI Analysis

This work addresses the challenge of recovering visual space from any views for applications in computer vision, representing an incremental advancement over previous models like Depth Anything 2.

The paper tackles the problem of predicting spatially consistent geometry from arbitrary visual inputs without requiring known camera poses, achieving state-of-the-art results with a 44.3% improvement in camera pose accuracy and 25.1% in geometric accuracy over prior methods.

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes