CVApr 23, 2023
Informative Data Selection with Uncertainty for Multi-modal Object DetectionXinyu Zhang, Zhiwei Li, Zhenhong Zou et al. · tsinghua
Noise has always been nonnegligible trouble in object detection by creating confusion in model reasoning, thereby reducing the informativeness of the data. It can lead to inaccurate recognition due to the shift in the observed pattern, that requires a robust generalization of the models. To implement a general vision model, we need to develop deep learning models that can adaptively select valid information from multi-modal data. This is mainly based on two reasons. Multi-modal learning can break through the inherent defects of single-modal data, and adaptive information selection can reduce chaos in multi-modal data. To tackle this problem, we propose a universal uncertainty-aware multi-modal fusion model. It adopts a multi-pipeline loosely coupled architecture to combine the features and results from point clouds and images. To quantify the correlation in multi-modal information, we model the uncertainty, as the inverse of data information, in different modalities and embed it in the bounding box generation. In this way, our model reduces the randomness in fusion and generates reliable output. Moreover, we conducted a completed investigation on the KITTI 2D object detection dataset and its derived dirty data. Our fusion model is proven to resist severe noise interference like Gaussian, motion blur, and frost, with only slight degradation. The experiment results demonstrate the benefits of our adaptive fusion. Our analysis on the robustness of multi-modal fusion will provide further insights for future research.
CVAug 24, 2023
SkipcrossNets: Adaptive Skip-cross Fusion for Road DetectionYan Gong, Xinyu Zhang, Hao Liu et al.
Multi-modal fusion is increasingly being used for autonomous driving tasks, as different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. To reduce the loss of height and depth information during the process of projecting point clouds into 2D space, we utilize calibration parameters to project the point cloud into Altitude Difference Images (ADIs), which exhibit more distinct road features. In this study, we propose a novel fusion architecture called Skip-cross Networks (SkipcrossNets), which combine adaptively ADIs and camera images without being bound to a certain fusion epoch. Specifically, skip-cross fusion strategy connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two modalities, enhancing feature reuse and providing complementary effects for sparse point cloud features. The advantages of skip-cross fusion strategy is demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters require only 2.33 MB of memory at a speed of 68.24 FPS, which can be viable for mobile terminals and embedded devices.