CVMar 22, 2021

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, Elisa Ricci

arXiv:2103.12091v229.3249 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a key limitation in computer vision for tasks requiring continuous pixel-wise predictions, offering a novel approach that improves accuracy, though it is incremental as it builds on existing transformer and CNN methods.

The paper tackles the problem of modeling long-range dependencies in pixel-wise prediction tasks like monocular depth and surface normal estimation by proposing TransDepth, a hybrid architecture combining CNNs and transformers with a novel gated attention decoder, achieving state-of-the-art performance on three challenging datasets.

While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local-level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation). Extensive experiments demonstrate that the proposed TransDepth achieves state-of-the-art performance on three challenging datasets. Our code is available at: https://github.com/ygjwd12345/TransDepth.

View on arXiv PDF Code

Similar