Fei Xue

CV
h-index21
16papers
516citations
Novelty56%
AI Score45

16 Papers

CVApr 28, 2023
SFD2: Semantic-guided Feature Detection and Description

Fei Xue, Ignas Budvytis, Roberto Cipolla

Visual localization is a fundamental task for various applications including autonomous driving and robotics. Prior methods focus on extracting large amounts of often redundant locally reliable features, resulting in limited efficiency and accuracy, especially in large-scale environments under challenging conditions. Instead, we propose to extract globally reliable features by implicitly embedding high-level semantics into both the detection and description processes. Specifically, our semantic-aware detector is able to detect keypoints from reliable regions (e.g. building, traffic lane) and suppress unreliable areas (e.g. sky, car) implicitly instead of relying on explicit semantic labels. This boosts the accuracy of keypoint matching by reducing the number of features sensitive to appearance changes and avoiding the need of additional segmentation networks at test time. Moreover, our descriptors are augmented with semantics and have stronger discriminative ability, providing more inliers at test time. Particularly, experiments on long-term large-scale visual localization Aachen Day-Night and RobotCar-Seasons datasets demonstrate that our model outperforms previous local features and gives competitive accuracy to advanced matchers but is about 2 and 3 times faster when using 2k and 4k keypoints, respectively.

CVApr 28, 2023
IMP: Iterative Matching and Pose Estimation with Adaptive Pooling

Fei Xue, Ignas Budvytis, Roberto Cipolla

Previous methods solve feature matching and pose estimation using a two-stage process by first finding matches and then estimating the pose. As they ignore the geometric relationships between the two tasks, they focus on either improving the quality of matches or filtering potential outliers, leading to limited efficiency or accuracy. In contrast, we propose an iterative matching and pose estimation framework (IMP) leveraging the geometric connections between the two tasks: a few good matches are enough for a roughly accurate pose estimation; a roughly accurate pose can be used to guide the matching by providing geometric constraints. To this end, we implement a geometry-aware recurrent attention-based module which jointly outputs sparse matches and camera poses. Specifically, for each iteration, we first implicitly embed geometric information into the module via a pose-consistency loss, allowing it to predict geometry-aware matches progressively. Second, we introduce an \textbf{e}fficient IMP, called EIMP, to dynamically discard keypoints without potential matches, avoiding redundant updating and significantly reducing the quadratic time complexity of attention computation in transformers. Experiments on YFCC100m, Scannet, and Aachen Day-Night datasets demonstrate that the proposed method outperforms previous approaches in terms of accuracy and efficiency.

CVMay 8, 2021Code
Active Terahertz Imaging Dataset for Concealed Object Detection

Dong Liang, Fei Xue, Ling Li

Concealed object detection in Terahertz imaging is an urgent need for public security and counter-terrorism. In this paper, we provide a public dataset for evaluating multi-object detection algorithms in active Terahertz imaging resolution 5 mm by 5 mm. To the best of our knowledge, this is the first public Terahertz imaging dataset prepared to evaluate object detection algorithms. Object detection on this dataset is much more difficult than on those standard public object detection datasets due to its inferior imaging quality. Facing the problem of imbalanced samples in object detection and hard training samples, we evaluate four popular detectors: YOLOv3, YOLOv4, FRCN-OHEM, and RetinaNet on this dataset. Experimental results indicate that the RetinaNet achieves the highest mAP. In addition, we demonstrate that hiding objects in different parts of the human body affect detection accuracy. The dataset is available at https://github.com/LingLIx/THz_Dataset.

CVJan 24, 2025
MATCHA:Towards Matching Anything

Fei Xue, Sven Elflein, Laura Leal-Taixé et al.

Establishing correspondences across images is a fundamental challenge in computer vision, underpinning tasks like Structure-from-Motion, image editing, and point tracking. Traditional methods are often specialized for specific correspondence types, geometric, semantic, or temporal, whereas humans naturally identify alignments across these domains. Inspired by this flexibility, we propose MATCHA, a unified feature model designed to ``rule them all'', establishing robust correspondences across diverse matching tasks. Building on insights that diffusion model features can encode multiple correspondence types, MATCHA augments this capacity by dynamically fusing high-level semantic and low-level geometric features through an attention-based module, creating expressive, versatile, and robust features. Additionally, MATCHA integrates object-level features from DINOv2 to further boost generalization, enabling a single feature capable of matching anything. Extensive experiments validate that MATCHA consistently surpasses state-of-the-art methods across geometric, semantic, and temporal matching tasks, setting a new foundation for a unified approach for the fundamental correspondence problem in computer vision. To the best of our knowledge, MATCHA is the first approach that is able to effectively tackle diverse matching tasks with a single unified feature.

CVApr 14, 2024
VRS-NeRF: Visual Relocalization with Sparse Neural Radiance Field

Fei Xue, Ignas Budvytis, Daniel Olmeda Reino et al.

Visual relocalization is a key technique to autonomous driving, robotics, and virtual/augmented reality. After decades of explorations, absolute pose regression (APR), scene coordinate regression (SCR), and hierarchical methods (HMs) have become the most popular frameworks. However, in spite of high efficiency, APRs and SCRs have limited accuracy especially in large-scale outdoor scenes; HMs are accurate but need to store a large number of 2D descriptors for matching, resulting in poor efficiency. In this paper, we propose an efficient and accurate framework, called VRS-NeRF, for visual relocalization with sparse neural radiance field. Precisely, we introduce an explicit geometric map (EGM) for 3D map representation and an implicit learning map (ILM) for sparse patches rendering. In this localization process, EGP provides priors of spare 2D points and ILM utilizes these sparse points to render patches with sparse NeRFs for matching. This allows us to discard a large number of 2D descriptors so as to reduce the map size. Moreover, rendering patches only for useful points rather than all pixels in the whole image reduces the rendering time significantly. This framework inherits the accuracy of HMs and discards their low efficiency. Experiments on 7Scenes, CambridgeLandmarks, and Aachen datasets show that our method gives much better accuracy than APRs and SCRs, and close performance to HMs but is much more efficient.

CVApr 10
Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Shunkai Zhou, Zike Yan, Fei Xue et al.

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/

CVApr 11, 2024
PRAM: Place Recognition Anywhere Model for Efficient Visual Localization

Fei Xue, Ignas Budvytis, Roberto Cipolla

Visual localization is a key technique to a variety of applications, e.g., autonomous driving, AR/VR, and robotics. For these real applications, both efficiency and accuracy are important especially on edge devices with limited computing resources. However, previous frameworks, e.g., absolute pose regression (APR), scene coordinate regression (SCR), and the hierarchical method (HM), have limited either accuracy or efficiency in both indoor and outdoor environments. In this paper, we propose the place recognition anywhere model (PRAM), a new framework, to perform visual localization efficiently and accurately by recognizing 3D landmarks. Specifically, PRAM first generates landmarks directly in 3D space in a self-supervised manner. Without relying on commonly used classic semantic labels, these 3D landmarks can be defined in any place in indoor and outdoor scenes with higher generalization ability. Representing the map with 3D landmarks, PRAM discards global descriptors, repetitive local descriptors, and redundant 3D points, increasing the memory efficiency significantly. Then, sparse keypoints, rather than dense pixels, are utilized as the input tokens to a transformer-based recognition module for landmark recognition, which enables PRAM to recognize hundreds of landmarks with high time and memory efficiency. At test time, sparse keypoints and predicted landmark labels are utilized for outlier removal and landmark-wise 2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and SCRs in large-scale scenes with a large margin and gives competitive accuracy to HMs but reduces over 90\% memory cost and runs 2.4 times faster, leading to a better balance between efficiency and accuracy.

MEJun 7, 2021
Statistical Inference for High-Dimensional Linear Regression with Blockwise Missing Data

Fei Xue, Rong Ma, Hongzhe Li

Blockwise missing data occurs frequently when we integrate multisource or multimodality data where different sources or modalities contain complementary information. In this paper, we consider a high-dimensional linear regression model with blockwise missing covariates and a partially observed response variable. Under this framework, we propose a computationally efficient estimator for the regression coefficient vector based on carefully constructed unbiased estimating equations and a blockwise imputation procedure, and obtain its rate of convergence. Furthermore, building upon an innovative projected estimating equation technique that intrinsically achieves bias-correction of the initial estimator, we propose a nearly unbiased estimator for each individual regression coefficient, which is asymptotically normally distributed under mild conditions. Based on these debiased estimators, asymptotically valid confidence intervals and statistical tests about each regression coefficient are constructed. Numerical studies and application analysis of the Alzheimer's Disease Neuroimaging Initiative data show that the proposed method performs better and benefits more from unsupervised samples than existing methods.

CVSep 21, 2020
Line Flow based SLAM

Qiuyuan Wang, Zike Yan, Junqiu Wang et al.

We propose a visual SLAM method by predicting and updating line flows that represent sequential 2D projections of 3D line segments. While feature-based SLAM methods have achieved excellent results, they still face problems in challenging scenes containing occlusions, blurred images, and repetitive textures. To address these problems, we leverage a line flow to encode the coherence of line segment observations of the same 3D line along the temporal dimension, which has been neglected in prior SLAM systems. Thanks to this line flow representation, line segments in a new frame can be predicted according to their corresponding 3D lines and their predecessors along the temporal dimension. We create, update, merge, and discard line flows on-the-fly. We model the proposed line flow based SLAM (LF-SLAM) using a Bayesian network. Extensive experimental results demonstrate that the proposed LF-SLAM method achieves state-of-the-art results due to the utilization of line flows. Specifically, LF-SLAM obtains good localization and mapping results in challenging scenes with occlusions, blurred images, and repetitive textures.

ROAug 2, 2020
Deep Visual Odometry with Adaptive Memory

Fei Xue, Xin Wang, Junqiu Wang et al.

We propose a novel deep visual odometry (VO) method that considers global information by selecting memory and refining poses. Existing learning-based methods take the VO task as a pure tracking problem via recovering camera poses from image snippets, leading to severe error accumulation. Global information is crucial for alleviating accumulated errors. However, it is challenging to effectively preserve such information for end-to-end systems. To deal with this challenge, we design an adaptive memory module, which progressively and adaptively saves the information from local to global in a neural analogue of memory, enabling our system to process long-term dependency. Benefiting from global information in the memory, previous results are further refined by an additional refining module. With the guidance of previous outputs, we adopt a spatial-temporal attention to select features for each view based on the co-visibility in feature domain. Specifically, our architecture consisting of Tracking, Remembering and Refining modules works beyond tracking. Experiments on the KITTI and TUM-RGBD datasets demonstrate that our approach outperforms state-of-the-art methods by large margins and produces competitive results against classic approaches in regular scenes. Moreover, our model achieves outstanding performance in challenging scenarios such as texture-less regions and abrupt motions, where classic algorithms tend to fail.

CVMay 13, 2020
Self-Supervised Deep Visual Odometry with Online Adaptation

Shunkai Li, Xin Wang, Yingdian Cao et al.

Self-supervised VO methods have shown great success in jointly estimating camera pose and depth from videos. However, like most data-driven methods, existing VO networks suffer from a notable decrease in performance when confronted with scenes different from the training data, which makes them unsuitable for practical applications. In this paper, we propose an online meta-learning algorithm to enable VO networks to continuously adapt to new environments in a self-supervised manner. The proposed method utilizes convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the past. The network is able to memorize and learn from its past experience for better estimation and fast adaptation to the current frame. When running VO in the open world, in order to deal with the changing environment, we propose an online feature alignment method by aligning feature distributions at different time. Our VO network is able to seamlessly adapt to different environments. Extensive experiments on unseen outdoor scenes, virtual to real world and outdoor to indoor environments demonstrate that our method consistently outperforms state-of-the-art self-supervised VO baselines considerably.

MEJan 14, 2020
Multicategory Angle-based Learning for Estimating Optimal Dynamic Treatment Regimes with Censored Data

Fei Xue, Yanqing Zhang, Wenzhuo Zhou et al.

An optimal dynamic treatment regime (DTR) consists of a sequence of decision rules in maximizing long-term benefits, which is applicable for chronic diseases such as HIV infection or cancer. In this paper, we develop a novel angle-based approach to search the optimal DTR under a multicategory treatment framework for survival data. The proposed method targets maximization the conditional survival function of patients following a DTR. In contrast to most existing approaches which are designed to maximize the expected survival time under a binary treatment framework, the proposed method solves the multicategory treatment problem given multiple stages for censored data. Specifically, the proposed method obtains the optimal DTR via integrating estimations of decision rules at multiple stages into a single multicategory classification algorithm without imposing additional constraints, which is also more computationally efficient and robust. In theory, we establish Fisher consistency of the proposed method under regularity conditions. Our numerical studies show that the proposed method outperforms competing methods in terms of maximizing the conditional survival function. We apply the proposed method to two real datasets: Framingham heart study data and acquired immunodeficiency syndrome (AIDS) clinical data.

CVAug 23, 2019
Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry

Shunkai Li, Fei Xue, Xin Wang et al.

We propose a self-supervised learning framework for visual odometry (VO) that incorporates correlation of consecutive frames and takes advantage of adversarial learning. Previous methods tackle self-supervised VO as a local structure from motion (SfM) problem that recovers depth from single image and relative poses from image pairs by minimizing photometric loss between warped and captured images. As single-view depth estimation is an ill-posed problem, and photometric loss is incapable of discriminating distortion artifacts of warped images, the estimated depth is vague and pose is inaccurate. In contrast to previous methods, our framework learns a compact representation of frame-to-frame correlation, which is updated by incorporating sequential information. The updated representation is used for depth estimation. Besides, we tackle VO as a self-supervised image generation task and take advantage of Generative Adversarial Networks (GAN). The generator learns to estimate depth and pose to generate a warped target image. The discriminator evaluates the quality of generated image with high-level structural perception that overcomes the problem of pixel-wise loss in previous methods. Experiments on KITTI and Cityscapes datasets show that our method obtains more accurate depth with details preserved and predicted pose outperforms state-of-the-art self-supervised methods significantly.

CVAug 6, 2019
Local Supports Global: Deep Camera Relocalization with Sequence Enhancement

Fei Xue, Xin Wang, Zike Yan et al.

We propose to leverage the local information in image sequences to support global camera relocalization. In contrast to previous methods that regress global poses from single images, we exploit the spatial-temporal consistency in sequential images to alleviate uncertainty due to visual ambiguities by incorporating a visual odometry (VO) component. Specifically, we introduce two effective steps called content-augmented pose estimation and motion-based refinement. The content-augmentation step focuses on alleviating the uncertainty of pose estimation by augmenting the observation based on the co-visibility in local maps built by the VO stream. Besides, the motion-based refinement is formulated as a pose graph, where the camera poses are further optimized by adopting relative poses provided by the VO component as additional motion constraints. Thus, the global consistency can be guaranteed. Experiments on the public indoor 7-Scenes and outdoor Oxford RobotCar benchmark datasets demonstrate that benefited from local information inherent in the sequence, our approach outperforms state-of-the-art methods, especially in some challenging cases, e.g., insufficient texture, highly repetitive textures, similar appearances, and over-exposure.

CVApr 3, 2019
Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry

Fei Xue, Xin Wang, Shunkai Li et al.

Most previous learning-based visual odometry (VO) methods take VO as a pure tracking problem. In contrast, we present a VO framework by incorporating two additional components called Memory and Refining. The Memory component preserves global information by employing an adaptive and efficient selection strategy. The Refining component ameliorates previous results with the contexts stored in the Memory by adopting a spatial-temporal attention mechanism for feature distilling. Experiments on the KITTI and TUM-RGBD benchmark datasets demonstrate that our method outperforms state-of-the-art learning-based methods by a large margin and produces competitive results against classic monocular VO approaches. Especially, our model achieves outstanding performance in challenging scenarios such as texture-less regions and abrupt motions, where classic VO algorithms tend to fail.

CVNov 25, 2018
Guided Feature Selection for Deep Visual Odometry

Fei Xue, Qiuyuan Wang, Xin Wang et al.

We present a novel end-to-end visual odometry architecture with guided feature selection based on deep convolutional recurrent neural networks. Different from current monocular visual odometry methods, our approach is established on the intuition that features contribute discriminately to different motion patterns. Specifically, we propose a dual-branch recurrent network to learn the rotation and translation separately by leveraging current Convolutional Neural Network (CNN) for feature representation and Recurrent Neural Network (RNN) for image sequence reasoning. To enhance the ability of feature selection, we further introduce an effective context-aware guidance mechanism to force each branch to distill related information for specific motion pattern explicitly. Experiments demonstrate that on the prevalent KITTI and ICL_NUIM benchmarks, our method outperforms current state-of-the-art model- and learning-based methods for both decoupled and joint camera pose recovery.