CVJun 3, 2024

Learning Temporally Consistent Video Depth from Video Diffusion Priors

arXiv:2406.01493v4102 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of streamed video depth estimation for applications requiring cross-frame consistency, representing an incremental improvement with a novel training and inference strategy.

The paper tackles the problem of achieving temporally consistent depth estimation in videos, proposing a method that reformulates depth prediction as a conditional generation problem to share contextual information across frames and clips, resulting in superior performance validated through extensive experiments.

This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Therefore, we reformulate depth prediction into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes