CVNov 11, 2025

Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection

Shenao Zhao, Pengpeng Liang, Zhoufan Yang

arXiv:2511.07966v13.6Has Code

Originality Incremental advance

AI Analysis

This work addresses domain adaptation for 3D object detection, a key challenge in autonomous driving, by leveraging multi-modal data to improve robustness across different environments, though it is incremental as it builds on existing teacher-student and pseudo-label frameworks.

The paper tackles the problem of unsupervised domain adaptation for LiDAR-based 3D object detection by proposing MMAssist, which uses multi-modal assistance from images and text to align 3D features between domains, achieving promising performance compared to state-of-the-art methods on three datasets.

Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box's text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.

View on arXiv PDF Code

Similar