CVMay 11

Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation

arXiv:2605.102511.0
Predicted impact top 99% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in depth estimation, this work offers an efficient alternative to transformer-based models by using GNNs to capture long-range dependencies with linear complexity, though the gains are incremental over existing hybrid approaches.

GraphDepth integrates Graph Neural Networks into a convolutional encoder-decoder for monocular depth estimation, achieving competitive accuracy within 4.6% of transformer-based SOTA on indoor scenes while being faster (25 FPS vs 9 FPS) and more memory-efficient (3.8 GB vs 8.8 GB VRAM). It also sets a new state-of-the-art on WHU Aerial (RMSE 8.24 m) and shows strong zero-shot cross-domain transfer.

We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes