CVOct 15, 2021

Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for Consistent Self-Supervised Monocular Depth Estimation

arXiv:2110.08192v144 citations
Originality Incremental advance
AI Analysis

This work addresses depth consistency issues in self-supervised monocular depth estimation, which is important for applications like autonomous driving and robotics, though it appears incremental as it builds on existing transformer-based pipelines.

The paper tackles the challenge of achieving geometrically consistent dense 3D scenes from consecutive monocular images in self-supervised depth estimation, resulting in improved temporal depth stability and accuracy compared to previous methods.

Inferring geometrically consistent dense 3D scenes across a tuple of temporally consecutive images remains challenging for self-supervised monocular depth prediction pipelines. This paper explores how the increasingly popular transformer architecture, together with novel regularized loss formulations, can improve depth consistency while preserving accuracy. We propose a spatial attention module that correlates coarse depth predictions to aggregate local geometric information. A novel temporal attention mechanism further processes the local geometric information in a global context across consecutive images. Additionally, we introduce geometric constraints between frames regularized by photometric cycle consistency. By combining our proposed regularization and the novel spatial-temporal-attention module we fully leverage both the geometric and appearance-based consistency across monocular frames. This yields geometrically meaningful attention and improves temporal depth stability and accuracy compared to previous methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes