CVNov 26, 2024

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

arXiv:2411.17385v312 citationsh-index: 32CVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding emergent geometric properties in vision models for researchers in computer vision and AI, though it is incremental as it builds on existing studies of model capabilities.

The paper investigates how monocular depth perception emerges in large pre-trained vision models without explicit depth supervision, introducing the DepthCues benchmark to evaluate depth cue understanding across 20 models and finding that human-like cues appear in larger models, with fine-tuning on DepthCues improving depth estimation even without dense supervision.

Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes