Xiongfei Li

CV
h-index4
5papers
26citations
Novelty54%
AI Score36

5 Papers

CVJun 11, 2024Code
A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Xiaoli Zhang, Liying Wang, Libo Zhao et al.

Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the problem of the neglect of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation. The code is avaliable at https://github.com/Abraham-Einstein/SMFNet/.

CVJul 25, 2025
Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching

Abu Sadat Mohammad Salehin Amit, Xiaoli Zhang, Md Masum Billa Shagar et al.

Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to distinctive local features while incorporating broader contextual information, resulting in robust matching across diverse remote sensing modalities. To demonstrate the practical utility of improved feature matching, we evaluate CSTF on object detection tasks using the HRSC2016 and DOTA benchmark datasets. Our method achieves state-of-theart performance with an average mAP of 90.99% on HRSC2016 and 90.86% on DOTA, outperforming existing models. The CSTF model maintains computational efficiency with an inference speed of 12.5 FPS. These results validate that our approach to crossmodal feature matching directly enhances downstream remote sensing applications such as object detection.

CVOct 8, 2021
Bounding-box deep calibration for high performance face detection

Shi Luo, Xiongfei Li, Xiaoli Zhang

Modern convolutional neural networks (CNNs)-based face detectors have achieved tremendous strides due to large annotated datasets. However, misaligned results with high detection confidence but low localization accuracy restrict the further improvement of detection performance. In this paper, the authors first predict high confidence detection results on the training set itself. Surprisingly, a considerable part of them exist in the same misalignment problem. Then, the authors carefully examine these cases and point out that annotation misalignment is the main reason. Later, a comprehensive discussion is given for the replacement rationality between predicted and annotated bounding-boxes. Finally, the authors propose a novel Bounding-Box Deep Calibration (BDC) method to reasonably replace misaligned annotations with model predicted bounding-boxes and offer calibrated annotations for the training set. Extensive experiments on multiple detectors and two popular benchmark datasets show the effectiveness of BDC on improving models' precision and recall rate, without adding extra inference time and memory consumption. Our simple and effective method provides a general strategy for improving face detection, especially for light-weight detectors in real-time situations.

CVMar 10, 2021
Wide Aspect Ratio Matching for Robust Face Detection

Shi Luo, Xiongfei Li, Xiaoli Zhang

Recently, anchor-based methods have achieved great progress in face detection. Once anchor design and anchor matching strategy determined, plenty of positive anchors will be sampled. However, faces with extreme aspect ratio always fail to be sampled according to standard anchor matching strategy. In fact, the max IoUs between anchors and extreme aspect ratio faces are still lower than fixed sampling threshold. In this paper, we firstly explore the factors that affect the max IoU of each face in theory. Then, anchor matching simulation is performed to evaluate the sampling range of face aspect ratio. Besides, we propose a Wide Aspect Ratio Matching (WARM) strategy to collect more representative positive anchors from ground-truth faces across a wide range of aspect ratio. Finally, we present a novel feature enhancement module, named Receptive Field Diversity (RFD) module, to provide diverse receptive field corresponding to different aspect ratios. Extensive experiments show that our method can help detectors better capture extreme aspect ratio faces and achieve promising detection performance on challenging face detection benchmarks, including WIDER FACE and FDDB datasets.

CVDec 20, 2018
SFA: Small Faces Attention Face Detector

Shi Luo, Xiongfei Li, Rui Zhu et al.

In recent year, tremendous strides have been made in face detection thanks to deep learning. However, most published face detectors deteriorate dramatically as the faces become smaller. In this paper, we present the Small Faces Attention (SFA) face detector to better detect faces with small scale. First, we propose a new scale-invariant face detection architecture which pays more attention to small faces, including 4-branch detection architecture and small faces sensitive anchor design. Second, feature maps fusion strategy is applied in SFA by partially combining high-level features into low-level features to further improve the ability of finding hard faces. Third, we use multi-scale training and testing strategy to enhance face detection performance in practice. Comprehensive experiments show that SFA significantly improves face detection performance, especially on small faces. Our real-time SFA face detector can run at 5 FPS on a single GPU as well as maintain high performance. Besides, our final SFA face detector achieves state-of-the-art detection performance on challenging face detection benchmarks, including WIDER FACE and FDDB datasets, with competitive runtime speed. Both our code and models will be available to the research community.