CVAug 15, 2023

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Peking U
arXiv:2308.07732v1108 citationsh-index: 137Has Code
Originality Highly original
AI Analysis

This addresses the need for accurate and robust perception in autonomous driving systems by improving efficiency and performance over modality-specific methods.

The paper tackles the problem of inefficient multi-modal processing in 3D perception for autonomous driving by proposing UniTR, a unified transformer backbone that achieves state-of-the-art results, including +1.1 NDS for 3D object detection and +12.0 mIoU for BEV map segmentation on the nuScenes benchmark.

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR .

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes