CVMar 7, 2025

Stereo Any Video: Temporally Consistent Stereo Matching

arXiv:2503.05549v34 citationsh-index: 54
Originality Highly original
AI Analysis

This addresses the problem of generating stable 3D reconstructions from videos for applications like robotics and AR/VR, representing a strong domain-specific advance.

The paper tackles video stereo matching by developing a framework that estimates accurate and temporally consistent disparities without needing camera poses or optical flow, achieving state-of-the-art performance in zero-shot settings across multiple datasets.

This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes