Beinan Yu

CV
h-index7
7papers
180citations
Novelty59%
AI Score48

7 Papers

CVFeb 28, 2023Code
BEVPlace: Learning LiDAR-based Place Recognition using Bird's Eye View Images

Lun Luo, Shuhang Zheng, Yixuan Li et al.

Place recognition is a key module for long-term SLAM systems. Current LiDAR-based place recognition methods usually use representations of point clouds such as unordered points or range images. These methods achieve high recall rates of retrieval, but their performance may degrade in the case of view variation or scene changes. In this work, we explore the potential of a different representation in place recognition, i.e. bird's eye view (BEV) images. We observe that the structural contents of BEV images are less influenced by rotations and translations of point clouds. We validate that, without any delicate design, a simple VGGNet trained on BEV images achieves comparable performance with the state-of-the-art place recognition methods in scenes of slight viewpoint changes. For more robust place recognition, we design a rotation-invariant network called BEVPlace. We use group convolution to extract rotation-equivariant local features from the images and NetVLAD for global feature aggregation. In addition, we observe that the distance between BEV features is correlated with the geometry distance of point clouds. Based on the observation, we develop a method to estimate the position of the query cloud, extending the usage of place recognition. The experiments conducted on large-scale public datasets show that our method 1) achieves state-of-the-art performance in terms of recall rates, 2) is robust to view changes, 3) shows strong generalization ability, and 4) can estimate the positions of query point clouds. Source codes are publicly available at https://github.com/zjuluolun/BEVPlace.

CVJul 11, 2024Code
SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

Runmin Zhang, Jun Ma, Si-Yuan Cao et al.

We propose a novel unsupervised cross-modal homography estimation framework based on intra-modal Self-supervised learning, Correlation, and consistent feature map Projection, namely SCPNet. The concept of intra-modal self-supervised learning is first presented to facilitate the unsupervised cross-modal homography estimation. The correlation-based homography estimation network and the consistent feature map projection are combined to form the learnable architecture of SCPNet, boosting the unsupervised learning framework. SCPNet is the first to achieve effective unsupervised homography estimation on the satellite-map image pair cross-modal dataset, GoogleMap, under [-32,+32] offset on a 128x128 image, leading the supervised approach MHN by 14.0% of mean average corner error (MACE). We further conduct extensive experiments on several cross-modal/spectral and manually-made inconsistent datasets, on which SCPNet achieves the state-of-the-art (SOTA) performance among unsupervised approaches, and owns 49.0%, 25.2%, 36.4%, and 10.7% lower MACEs than the supervised approach MHN. Source code is available at https://github.com/RM-Zhang/SCPNet.

CVMar 2, 2023
I2P-Rec: Recognizing Images on Large-scale Point Cloud Maps through Bird's Eye View Projections

Shuhang Zheng, Yixuan Li, Zhu Yu et al.

Place recognition is an important technique for autonomous cars to achieve full autonomy since it can provide an initial guess to online localization algorithms. Although current methods based on images or point clouds have achieved satisfactory performance, localizing the images on a large-scale point cloud map remains a fairly unexplored problem. This cross-modal matching task is challenging due to the difficulty in extracting consistent descriptors from images and point clouds. In this paper, we propose the I2P-Rec method to solve the problem by transforming the cross-modal data into the same modality. Specifically, we leverage on the recent success of depth estimation networks to recover point clouds from images. We then project the point clouds into Bird's Eye View (BEV) images. Using the BEV image as an intermediate representation, we extract global features with a Convolutional Neural Network followed by a NetVLAD layer to perform matching. The experimental results evaluated on the KITTI dataset show that, with only a small set of training data, I2P-Rec achieves recall rates at Top-1\% over 80\% and 90\%, when localizing monocular and stereo images on point cloud maps, respectively. We further evaluate I2P-Rec on a 1 km trajectory dataset collected by an autonomous logistics car and show that I2P-Rec can generalize well to previously unseen environments.

CVFeb 24Code
Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection

Xiaokai Bai, Lianqing Zheng, Si-Yuan Cao et al.

4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at github.com/shawnnnkb/SIFormer.

CVSep 11, 2025
S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization

Chenghao Zhang, Lun Luo, Si-Yuan Cao et al.

LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird's-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.

CVMay 2, 2025
CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion

Boyuan Meng, Xiaohan Zhang, Peilin Li et al.

Cross-domain few-shot object detection (CD-FSOD) aims to detect novel objects across different domains with limited class instances. Feature confusion, including object-background confusion and object-object confusion, presents significant challenges in both cross-domain and few-shot settings. In this work, we introduce CDFormer, a cross-domain few-shot object detection transformer against feature confusion, to address these challenges. The method specifically tackles feature confusion through two key modules: object-background distinguishing (OBD) and object-object distinguishing (OOD). The OBD module leverages a learnable background token to differentiate between objects and background, while the OOD module enhances the distinction between objects of different classes. Experimental results demonstrate that CDFormer outperforms previous state-of-the-art approaches, achieving 12.9% mAP, 11.0% mAP, and 10.4% mAP improvements under the 1/5/10 shot settings, respectively, when fine-tuned.

CVJun 9, 2021
PCNet: A Structure Similarity Enhancement Method for Multispectral and Multimodal Image Registration

Si-Yuan Cao, Beinan Yu, Lun Luo et al.

Multispectral and multimodal images are of important usage in the field of multi-source visual information fusion. Due to the alternation or movement of image devices, the acquired multispectral and multimodal images are usually misaligned, and hence image registration is pre-requisite. Different from the registration of common images, the registration of multispectral or multimodal images is a challenging problem due to the nonlinear variation of intensity and gradient. To cope with this challenge, we propose the phase congruency network (PCNet) to enhance the structure similarity of multispectral or multimodal images. The images can then be aligned using the similarity-enhanced feature maps produced by the network. PCNet is constructed under the inspiration of the well-known phase congruency. The network embeds the phase congruency prior into two simple trainable layers and series of modified learnable Gabor kernels. Thanks to the prior knowledge, once trained, PCNet is applicable on a variety of multispectral and multimodal data such as flash/no-flash and RGB/NIR images without additional further tuning. The prior also makes the network lightweight. The trainable parameters of PCNet are 2400 times less than the deep-learning registration method DHN, while its registration performance surpasses DHN. Experimental results validate that PCNet outperforms current state-of-the-art conventional multimodal registration algorithms. Besides, PCNet can act as a complementary part of the deep-learning registration methods, which significantly boosts their registration accuracy. The percentage of the number of images under 1 pixel average corner error (ACE) of UDHN is raised from 0.2% to 89.9% after the processing of PCNet.