CVNov 20, 2022
Context-Aware Data Augmentation for LIDAR 3D Object DetectionXuzhong Hu, Zaipeng Duan, Jie Ma
For 3D object detection, labeling lidar point cloud is difficult, so data augmentation is an important module to make full use of precious annotated data. As a widely used data augmentation method, GT-sample effectively improves detection performance by inserting groundtruths into the lidar frame during training. However, these samples are often placed in unreasonable areas, which misleads model to learn the wrong context information between targets and backgrounds. To address this problem, in this paper, we propose a context-aware data augmentation method (CA-aug) , which ensures the reasonable placement of inserted objects by calculating the "Validspace" of the lidar point cloud. CA-aug is lightweight and compatible with other augmentation methods. Compared with the GT-sample and the similar method in Lidar-aug(SOTA), it brings higher accuracy to the existing detectors. We also present an in-depth study of augmentation methods for the range-view-based(RV-based) models and find that CA-aug can fully exploit the potential of RV-based networks. The experiment on KITTI val split shows that CA-aug can improve the mAP of the test model by 8%.
CVJul 22, 2025Code
SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy PredictionZaipeng Duan, Chenxu Dang, Xuzhong Hu et al.
Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.
CVFeb 28, 2025Code
FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object DetectionChenxu Dang, Zaipeng Duan, Pei An et al.
Recent top-performing temporal 3D detectors based on Lidars have increasingly adopted region-based paradigms. They first generate coarse proposals, followed by encoding and fusing regional features. However, indiscriminate sampling and fusion often overlook the varying contributions of individual points and lead to exponentially increased complexity as the number of input frames grows. Moreover, arbitrary result-level concatenation limits the global information extraction. In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in an adaptive and lightweight manner. Emphasizing the contribution of individual tokens, we propose a simple but effective Adaptive Scaling mechanism to capture geometric contexts while sifting out focal points. Adaptively storing and processing only focal points in historical frames dramatically reduces the overall complexity. Furthermore, a novel Grouped Hierarchical Fusion strategy is proposed, progressively performing sequence scaling and Intra-Group Fusion operations to facilitate the exchange of global spatial and temporal information. Experiments on the Waymo Open Dataset demonstrate that our FASTer significantly outperforms other state-of-the-art detectors in both performance and efficiency while also exhibiting improved flexibility and robustness. The code is available at https://github.com/MSunDYY/FASTer.git.
CVApr 2, 2025
Multimodal Point Cloud Semantic Segmentation With Virtual Point EnhancementZaipeng Duan, Xuzhong Hu, Pei An et al.
LiDAR-based 3D point cloud recognition has been proven beneficial in various applications. However, the sparsity and varying density pose a significant challenge in capturing intricate details of objects, particularly for medium-range and small targets. Therefore, we propose a multi-modal point cloud semantic segmentation method based on Virtual Point Enhancement (VPE), which integrates virtual points generated from images to address these issues. These virtual points are dense but noisy, and directly incorporating them can increase computational burden and degrade performance. Therefore, we introduce a spatial difference-driven adaptive filtering module that selectively extracts valuable pseudo points from these virtual points based on density and distance, enhancing the density of medium-range targets. Subsequently, we propose a noise-robust sparse feature encoder that incorporates noise-robust feature extraction and fine-grained feature enhancement. Noise-robust feature extraction exploits the 2D image space to reduce the impact of noisy points, while fine-grained feature enhancement boosts sparse geometric features through inner-voxel neighborhood point aggregation and downsampled voxel aggregation. The results on the SemanticKITTI and nuScenes, two large-scale benchmark data sets, have validated effectiveness, significantly improving 2.89\% mIoU with the introduction of 7.7\% virtual points on nuScenes.
CVMar 12, 2025
Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object DetectionXuzhong Hu, Zaipeng Duan, Pei An et al.
Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving. However, this paradigm is constrained by the excessive feature compression. While some works explore dense voxel fusion to enable better feature interaction, they face high computational costs and challenges in query generation. Additionally, feature misalignment in both domains results in suboptimal detection accuracy. To address these limitations, we propose a Dual-Domain Homogeneous Fusion network (DDHFusion), which leverages the complementarily of both BEV and voxel domains while mitigating their drawbacks. Specifically, we first transform image features into BEV and sparse voxel representations using lift-splat-shot and our proposed Semantic-Aware Feature Sampling (SAFS) module. The latter significantly reduces computational overhead by discarding unimportant voxels. Next, we introduce Homogeneous Voxel and BEV Fusion (HVF and HBF) networks for multi-modal fusion within respective domains. They are equipped with novel cross-modal Mamba blocks to resolve feature misalignment and enable comprehensive scene perception. The output voxel features are injected into the BEV space to compensate for the information loss brought by direct height compression. During query selection, the Progressive Query Generation (PQG) mechanism is implemented in the BEV domain to reduce false negatives caused by feature compression. Furthermore, we propose a Progressive Decoder (QD) that sequentially aggregates not only context-rich BEV features but also geometry-aware voxel features with deformable attention and the Multi-Modal Voxel Feature Mixing (MMVFM) block for precise classification and box regression.