CVMay 3

SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras

arXiv:2605.0192412.4
Predicted impact top 67% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For multi-camera autonomous driving, this work provides a unified end-to-end solution for simultaneous 2D and 3D detection, reducing reliance on expensive 3D labels.

SimPB++ proposes an end-to-end model that simultaneously detects 2D objects in perspective view and 3D objects in Bird's Eye View from multiple cameras, achieving state-of-the-art performance on nuScenes for both tasks and strong long-range detection up to 150m on Argoverse2.

Simultaneous perception of 2D objects in perspective view and 3D objects in Bird's Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a Propagating Denoising strategy with an auxiliary RoI detector. SimPB++ supports mixed supervision with 2D-only and fully annotated data, reducing reliance on expensive 3D labels. Experiments show state-of-the-art performance on nuScenes for both tasks and strong long-range detection (up to 150m) on Argoverse2.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes