CVMar 31, 2022

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

arXiv:2203.17270v21908 citationsHas Code
Originality Highly original
AI Analysis

This work addresses the problem of accurate 3D perception in autonomous driving systems, offering a significant performance improvement over existing camera-based methods.

The paper tackles 3D visual perception for autonomous driving by proposing BEVFormer, a framework that learns bird's-eye-view representations using spatiotemporal transformers, achieving a state-of-the-art NDS score of 56.9% on the nuScenes test set, which is 9.0 points higher than previous methods and matches LiDAR-based baselines.

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes