Self-supervised Monocular Depth Estimation with Large Kernel Attention
This work addresses depth estimation for computer vision applications, but it is incremental as it builds on existing self-supervised methods with a novel decoder design.
The paper tackles the problem of self-supervised monocular depth estimation by addressing limitations in existing methods that use Transformers, which treat 2D features as 1D sequences and overlook channel features, leading to a network with a decoder based on large kernel attention to model long-distance dependencies while preserving 2D structure and channel adaptivity, achieving competitive results on the KITTI dataset.
Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.