CVApr 20, 2025

Seurat: From Moving Points to Depth

arXiv:2504.14687v113 citationsh-index: 13Has CodeCVPR
Originality Highly original
AI Analysis

This addresses depth estimation for video analysis applications, representing a novel method for a known bottleneck in computer vision.

The paper tackles the problem of monocular depth estimation from videos by proposing a method that infers relative depth from tracked 2D trajectories using spatial and temporal transformers. It achieves robust zero-shot performance on the TAPVid-3D benchmark, generalizing from synthetic to real-world data with temporally smooth, high-accuracy predictions.

Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes