MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions
This work addresses the slow computation problem in multi-view stereo for computer vision applications, offering a significant speed improvement with competitive accuracy, though it is incremental as it builds on existing single-view networks with an attention mechanism.
The paper tackles the computational inefficiency of deep learning-based multi-view stereo systems by introducing MVS2D, an algorithm that uses attention-driven 2D convolutions to integrate multi-view constraints, achieving at least 2x faster computation than notable counterparts while producing state-of-the-art depth estimations and 3D reconstructions on benchmarks like ScanNet, SUN3D, RGBD, and DTU.
Deep learning has made significant impacts on multi-view stereo systems. State-of-the-art approaches typically involve building a cost volume, followed by multiple 3D convolution operations to recover the input image's pixel-wise depth. While such end-to-end learning of plane-sweeping stereo advances public benchmarks' accuracy, they are typically very slow to compute. We present \ouralg, a highly efficient multi-view stereo algorithm that seamlessly integrates multi-view constraints into single-view networks via an attention mechanism. Since \ouralg only builds on 2D convolutions, it is at least $2\times$ faster than all the notable counterparts. Moreover, our algorithm produces precise depth estimations and 3D reconstructions, achieving state-of-the-art results on challenging benchmarks ScanNet, SUN3D, RGBD, and the classical DTU dataset. our algorithm also out-performs all other algorithms in the setting of inexact camera poses. Our code is released at \url{https://github.com/zhenpeiyang/MVS2D}