Unsupervised High-Resolution Depth Learning From Videos With Dual Networks
This work improves monocular depth estimation for applications like autonomous driving and robotics by enabling more efficient use of high-resolution data, though it is incremental as it builds on existing unsupervised methods.
The paper tackles the problem of unsupervised depth learning from videos by addressing the loss of fine-grained details when down-sampling high-resolution images due to memory and computation constraints, proposing a dual networks architecture that directly processes high-resolution inputs to generate high-accuracy depth maps, achieving state-of-the-art results on KITTI and Make3D benchmarks.
Unsupervised depth learning takes the appearance difference between a target view and a view synthesized from its adjacent frame as supervisory signal. Since the supervisory signal only comes from images themselves, the resolution of training data significantly impacts the performance. High-resolution images contain more fine-grained details and provide more accurate supervisory signal. However, due to the limitation of memory and computation power, the original images are typically down-sampled during training, which suffers heavy loss of details and disparity accuracy. In order to fully explore the information contained in high-resolution data, we propose a simple yet effective dual networks architecture, which can directly take high-resolution images as input and generate high-resolution and high-accuracy depth map efficiently. We also propose a Self-assembled Attention (SA-Attention) module to handle low-texture region. The evaluation on the benchmark KITTI and Make3D datasets demonstrates that our method achieves state-of-the-art results in the monocular depth estimation task.