CVNov 5, 2020

Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers

arXiv:2011.02910v4387 citations
AI Analysis

This work addresses depth estimation for computer vision applications, offering a novel method that improves flexibility and generalization, though it is incremental as it builds on existing transformer and stereo techniques.

The authors tackled stereo depth estimation by replacing cost volume construction with a sequence-to-sequence approach using transformers for dense pixel matching, resulting in a method that relaxes fixed disparity limits, identifies occlusions, and generalizes across domains without fine-tuning.

Stereo depth estimation relies on optimal correspondence matching between pixels on epipolar lines in the left and right images to infer depth. In this work, we revisit the problem from a sequence-to-sequence correspondence perspective to replace cost volume construction with dense pixel matching using position information and attention. This approach, named STereo TRansformer (STTR), has several advantages: It 1) relaxes the limitation of a fixed disparity range, 2) identifies occluded regions and provides confidence estimates, and 3) imposes uniqueness constraints during the matching process. We report promising results on both synthetic and real-world datasets and demonstrate that STTR generalizes across different domains, even without fine-tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes