CVOct 14, 2024

ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

arXiv:2410.11019v23 citationsh-index: 16Has CodeIROS
Originality Incremental advance
AI Analysis

This work addresses the problem of generating accurate 3D semantic occupancy maps for autonomous navigation, though it appears incremental as it builds on existing methods with specific enhancements.

The paper tackles 3D semantic scene completion from a single monocular camera, achieving state-of-the-art results with an IoU improvement from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on the Semantic-KITTI dataset, while using only 10.9 GB of GPU memory.

We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB. Project page: https://github.com/jingGM/ET-Former.git.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes