CVAILGFeb 15, 2023

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Tsinghua
arXiv:2302.07817v2511 citationsh-index: 97Has Code
Originality Incremental advance
AI Analysis

This addresses the need for accurate 3D scene understanding in autonomous driving, offering a camera-only solution that matches LiDAR performance, which is a significant but incremental improvement over existing BEV methods.

The paper tackles the problem of fine-grained 3D semantic occupancy prediction in vision-centric autonomous driving by proposing a tri-perspective view (TPV) representation, achieving comparable performance to LiDAR-based methods on nuScenes with only camera inputs.

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes