CVAug 17, 2020

Self-Supervised Learning for Monocular Depth Estimation from Aerial Imagery

Max Hermann, Boitumelo Ruf, Martin Weinmann, Stefan Hinz

arXiv:2008.07246v115 citations

AI Analysis

This addresses the problem of acquiring ground truth data for aerial depth estimation, offering a self-supervised solution for researchers and practitioners in remote sensing or computer vision, but it is incremental as it builds on existing self-supervised techniques.

The paper tackles monocular depth estimation from aerial imagery without annotated data by using self-supervised learning from image sequences, achieving up to 93.5% accuracy on δ1.25 metric. It shows the method is suitable for initialization or use in challenging regions like occluded areas, though results are inferior to conventional methods.

Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy δ1.25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.

View on arXiv PDF

Similar