CVMay 17, 2023

CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo

arXiv:2305.10320v133 citations
Originality Highly original
AI Analysis

This work addresses a domain-specific bottleneck in 3D reconstruction for computer vision applications, offering a plug-in improvement over existing methods.

The paper tackles the problem of repetitive or incorrect matches in multi-view stereo cost aggregation by introducing CostFormer, a Transformer-based network that overcomes computational complexity issues and achieves state-of-the-art results on DTU and Tanks and Temples benchmarks.

The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression Transformer(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes