CVJun 8, 2022

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

arXiv:2206.03666v126 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses performance gaps in monocular 3D perception for autonomous driving applications, offering incremental improvements through enhanced depth estimation.

The paper tackled the problem of inferior performance in monocular 3D perception compared to LiDAR-based methods by identifying per-object depth estimation as a key bottleneck, and proposed a multi-level fusion method that achieved state-of-the-art depth estimation on Waymo Open and KITTI datasets, leading to significant improvements in 3D detection and tracking tasks.

Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor bounding the performance. Motivated by this observation, we propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation. Our proposed fusion method achieves the state-of-the-art performance of per-object depth estimation on the Waymo Open Dataset, the KITTI detection dataset, and the KITTI MOT dataset. We further demonstrate that by simply replacing estimated depth with fusion-enhanced depth, we can achieve significant improvements in monocular 3D perception tasks, including detection and tracking.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes