Generalizing Monocular 3D Object Detection
It addresses generalization issues in monocular 3D object detection for applications like autonomous driving and robotics, but it appears incremental as it builds on existing methods with specific improvements.
This thesis tackled the challenge of generalizing monocular 3D object detection models to diverse scenarios like occlusions, datasets, object sizes, and camera parameters, resulting in proposed methods such as GrooMeD-NMS for occlusion robustness, DEVIANT backbones for dataset generalization, and SeaBird for large object detection.
Monocular 3D object detection (Mono3D) is a fundamental computer vision task that estimates an object's class, 3D position, dimensions, and orientation from a single image. Its applications, including autonomous driving, augmented reality, and robotics, critically rely on accurate 3D environmental understanding. This thesis addresses the challenge of generalizing Mono3D models to diverse scenarios, including occlusions, datasets, object sizes, and camera parameters. To enhance occlusion robustness, we propose a mathematically differentiable NMS (GrooMeD-NMS). To improve generalization to new datasets, we explore depth equivariant (DEVIANT) backbones. We address the issue of large object detection, demonstrating that it's not solely a data imbalance or receptive field problem but also a noise sensitivity issue. To mitigate this, we introduce a segmentation-based approach in bird's-eye view with dice loss (SeaBird). Finally, we mathematically analyze the extrapolation of Mono3D models to unseen camera heights and improve Mono3D generalization in such out-of-distribution settings.