CVAIAug 3, 2024

LAM3D: Leveraging Attention for Monocular 3D Object Detection

arXiv:2408.01739v12 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses 3D object detection from single images for autonomous driving, but it is incremental as it builds on existing transformer architectures.

The paper tackles monocular 3D object detection by proposing LAM3D, a framework that leverages self-attention in a Vision Transformer-based architecture, and it outperforms reference methods on the KITTI benchmark.

Since the introduction of the self-attention mechanism and the adoption of the Transformer architecture for Computer Vision tasks, the Vision Transformer-based architectures gained a lot of popularity in the field, being used for tasks such as image classification, object detection and image segmentation. However, efficiently leveraging the attention mechanism in vision transformers for the Monocular 3D Object Detection task remains an open question. In this paper, we present LAM3D, a framework that Leverages self-Attention mechanism for Monocular 3D object Detection. To do so, the proposed method is built upon a Pyramid Vision Transformer v2 (PVTv2) as feature extraction backbone and 2D/3D detection machinery. We evaluate the proposed method on the KITTI 3D Object Detection Benchmark, proving the applicability of the proposed solution in the autonomous driving domain and outperforming reference methods. Moreover, due to the usage of self-attention, LAM3D is able to systematically outperform the equivalent architecture that does not employ self-attention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes