CVDec 4, 2024

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

arXiv:2412.03079v283 citationsh-index: 17CVPR
AI Analysis

This addresses the challenge of temporal consistency in video depth estimation for applications like robotics and AR, but it is incremental as it builds on existing models like DUSt3R.

The paper tackles the problem of inconsistent depth estimation across frames in monocular videos by proposing Align3R, a method that aligns depth maps using a fine-tuned DUSt3R model and optimization, resulting in superior performance over baselines.

Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes