Lightweight Monocular Depth Estimation with an Edge Guided Network
This work addresses the need for efficient depth estimation in robotic applications, offering a lightweight solution with competitive performance.
The paper tackles the problem of high computational complexity in monocular depth estimation by proposing a lightweight Edge Guided Depth Estimation Network (EGD-Net) that incorporates edge information and a transformer-based feature aggregation module, achieving state-of-the-art accuracy at about 96 fps on a GTX 1080 GPU.
Monocular depth estimation is an important task that can be applied to many robotic applications. Existing methods focus on improving depth estimation accuracy via training increasingly deeper and wider networks, however these suffer from large computational complexity. Recent studies found that edge information are important cues for convolutional neural networks (CNNs) to estimate depth. Inspired by the above observations, we present a novel lightweight Edge Guided Depth Estimation Network (EGD-Net) in this study. In particular, we start out with a lightweight encoder-decoder architecture and embed an edge guidance branch which takes as input image gradients and multi-scale feature maps from the backbone to learn the edge attention features. In order to aggregate the context information and edge attention features, we design a transformer-based feature aggregation module (TRFA). TRFA captures the long-range dependencies between the context information and edge attention features through cross-attention mechanism. We perform extensive experiments on the NYU depth v2 dataset. Experimental results show that the proposed method runs about 96 fps on a Nvidia GTX 1080 GPU whilst achieving the state-of-the-art performance in terms of accuracy.