ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions
This work addresses spatiotemporal scene understanding for autonomous systems, presenting incremental improvements to existing frameworks.
The paper tackles 3D semantic occupancy and flow prediction by proposing a vision-based framework with three improvements: an occlusion-aware adaptive lifting mechanism with depth denoising, 3D-2D semantic consistency enforcement via optimized prototypes, and a BEV-centric cost volume for joint prediction. The method achieves new state-of-the-art performance on multiple benchmarks and offers a real-time version that exceeds existing real-time methods in speed and accuracy.
3D semantic occupancy and flow prediction are fundamental to spatiotemporal scene understanding. This paper proposes a vision-based framework with three targeted improvements. First, we introduce an occlusion-aware adaptive lifting mechanism incorporating depth denoising. This enhances the robustness of 2D-to-3D feature transformation while mitigating reliance on depth priors. Second, we enforce 3D-2D semantic consistency via jointly optimized prototypes, using confidence- and category-aware sampling to address the long-tail classes problem. Third, to streamline joint prediction, we devise a BEV-centric cost volume to explicitly correlate semantic and flow features, supervised by a hybrid classification-regression scheme that handles diverse motion scales. Our purely convolutional architecture establishes new SOTA performance on multiple benchmarks for both semantic occupancy and joint occupancy semantic-flow prediction. We also present a family of models offering a spectrum of efficiency-performance trade-offs. Our real-time version exceeds all existing real-time methods in speed and accuracy, ensuring its practical viability.