Xiang Wu

CV
h-index16
34papers
2,318citations
Novelty57%
AI Score51

34 Papers

AIAug 4, 2022
Developmental Network Two, Its Optimality, and Emergent Turing Machines

Juyang Weng, Zejia Zheng, Xiang Wu

Strong AI requires the learning engine to be task non-specific and to automatically construct a dynamic hierarchy of internal features. By hierarchy, we mean, e.g., short road edges and short bush edges amount to intermediate features of landmarks; but intermediate features from tree shadows are distractors that must be disregarded by the high-level landmark concept. By dynamic, we mean the automatic selection of features while disregarding distractors is not static, but instead based on dynamic statistics (e.g. because of the instability of shadows in the context of landmark). By internal features, we mean that they are not only sensory, but also motor, so that context from motor (state) integrates with sensory inputs to become a context-based logic machine. We present why strong AI is necessary for any practical AI systems that work reliably in the real world. We then present a new generation of Developmental Networks 2 (DN-2). With many new novelties beyond DN-1, the most important novelty of DN-2 is that the inhibition area of each internal neuron is neuron-specific and dynamic. This enables DN-2 to automatically construct an internal hierarchy that is fluid, whose number of areas is not static as in DN-1. To optimally use the limited resource available, we establish that DN-2 is optimal in terms of maximum likelihood, under the condition of limited learning experience and limited resources. We also present how DN-2 can learn an emergent Universal Turing Machine (UTM). Together with the optimality, we present the optimal UTM. Experiments for real-world vision-based navigation, maze planning, and audition used DN-2. They successfully showed that DN-2 is for general purposes using natural and synthetic inputs. Their automatically constructed internal representation focuses on important features while being invariant to distractors and other irrelevant context-concepts.

LGJan 30
Scalable Topology-Preserving Graph Coarsening with Graph Collapse

Xiang Wu, Rong-Hua Li, Xunkai Li et al.

Graph coarsening reduces the size of a graph while preserving certain properties. Most existing methods preserve either spectral or spatial characteristics. Recent research has shown that preserving topological features helps maintain the predictive performance of graph neural networks (GNNs) trained on the coarsened graph but suffers from exponential time complexity. To address these problems, we propose Scalable Topology-Preserving Graph Coarsening (STPGC) by introducing the concepts of graph strong collapse and graph edge collapse extended from algebraic topology. STPGC comprises three new algorithms, GStrongCollapse, GEdgeCollapse, and NeighborhoodConing based on these two concepts, which eliminate dominated nodes and edges while rigorously preserving topological features. We further prove that STPGC preserves the GNN receptive field and develop approximate algorithms to accelerate GNN training. Experiments on node classification with GNNs demonstrate the efficiency and effectiveness of STPGC.

CVSep 22, 2025Code
Multimodal Medical Image Classification via Synergistic Learning Pre-training

Qinghua Lin, Guang-Hai Liu, Zuoyong Li et al.

Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model's feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: https://github.com/LQH89757/MICS.

CVJan 21, 2021Code
CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification

Chaoyou Fu, Yibo Hu, Xiang Wu et al.

Visible-Infrared person re-identification (VI-ReID) aims to match cross-modality pedestrian images, breaking through the limitation of single-modality person ReID in dark environment. In order to mitigate the impact of large modality discrepancy, existing works manually design various two-stream architectures to separately learn modality-specific and modality-sharable representations. Such a manual design routine, however, highly depends on massive experiments and empirical practice, which is time consuming and labor intensive. In this paper, we systematically study the manually designed architectures, and identify that appropriately separating Batch Normalization (BN) layers is the key to bring a great boost towards cross-modality matching. Based on this observation, the essential objective is to find the optimal separation scheme for each BN layer. To this end, we propose a novel method, named Cross-Modality Neural Architecture Search (CM-NAS). It consists of a BN-oriented search space in which the standard optimization can be fulfilled subject to the cross-modality task. Equipped with the searched architecture, our method outperforms state-of-the-art counterparts in both two benchmarks, improving the Rank-1/mAP by 6.70%/6.13% on SYSU-MM01 and by 12.17%/11.23% on RegDB. Code is released at https://github.com/JDAI-CV/CM-NAS.

CVAug 12, 2020Code
TF-NAS: Rethinking Three Search Freedoms of Latency-Constrained Differentiable Neural Architecture Search

Yibo Hu, Xiang Wu, Ran He

With the flourish of differentiable neural architecture search (NAS), automatically searching latency-constrained architectures gives a new perspective to reduce human labor and expertise. However, the searched architectures are usually suboptimal in accuracy and may have large jitters around the target latency. In this paper, we rethink three freedoms of differentiable NAS, i.e. operation-level, depth-level and width-level, and propose a novel method, named Three-Freedom NAS (TF-NAS), to achieve both good classification accuracy and precise latency constraint. For the operation-level, we present a bi-sampling search algorithm to moderate the operation collapse. For the depth-level, we introduce a sink-connecting search space to ensure the mutual exclusion between skip and other candidate operations, as well as eliminate the architecture redundancy. For the width-level, we propose an elasticity-scaling strategy that achieves precise latency constraint in a progressively fine-grained manner. Experiments on ImageNet demonstrate the effectiveness of TF-NAS. Particularly, our searched TF-NAS-A obtains 76.9% top-1 accuracy, achieving state-of-the-art results with less latency. The total search time is only 1.8 days on 1 Titan RTX GPU. Code is available at https://github.com/AberHu/TF-NAS.

CVMar 25, 2019Code
Dual Variational Generation for Low-Shot Heterogeneous Face Recognition

Chaoyou Fu, Xiang Wu, Yibo Hu et al.

Heterogeneous Face Recognition (HFR) is a challenging issue because of the large domain discrepancy and a lack of heterogeneous data. This paper considers HFR as a dual generation problem, and proposes a novel Dual Variational Generation (DVG) framework. It generates large-scale new paired heterogeneous images with the same identity from noise, for the sake of reducing the domain gap of HFR. Specifically, we first introduce a dual variational autoencoder to represent a joint distribution of paired heterogeneous images. Then, in order to ensure the identity consistency of the generated paired heterogeneous images, we impose a distribution alignment in the latent space and a pairwise identity preserving in the image space. Moreover, the HFR network reduces the domain discrepancy by constraining the pairwise feature distances between the generated paired heterogeneous images. Extensive experiments on four HFR databases show that our method can significantly improve state-of-the-art results. The related code is available at https://github.com/BradyFU/DVG.

CVNov 9, 2015Code
A Light CNN for Deep Face Representation with Noisy Labels

Xiang Wu, Ran He, Zhenan Sun et al.

The volume of convolutional neural network (CNN) models proposed for face recognition has been continuously growing larger to better fit large amount of training data. When training data are obtained from internet, the labels are likely to be ambiguous and inaccurate. This paper presents a Light CNN framework to learn a compact embedding on the large-scale face data with massive noisy labels. First, we introduce a variation of maxout activation, called Max-Feature-Map (MFM), into each convolutional layer of CNN. Different from maxout activation that uses many feature maps to linearly approximate an arbitrary convex activation function, MFM does so via a competitive relationship. MFM can not only separate noisy and informative signals but also play the role of feature selection between two feature maps. Second, three networks are carefully designed to obtain better performance meanwhile reducing the number of parameters and computational costs. Lastly, a semantic bootstrapping method is proposed to make the prediction of the networks more consistent with noisy labels. Experimental results show that the proposed framework can utilize large-scale noisy data to learn a Light model that is efficient in computational costs and storage spaces. The learned single network with a 256-D representation achieves state-of-the-art results on various face benchmarks without fine-tuning. The code is released on https://github.com/AlfredXiangWu/LightCNN.

CVDec 29, 2025
OptFormer: Optical Flow-Guided Attention and Phase Space Reconstruction for SST Forecasting

Yin Wang, Chunlin Gong, Zhuozhen Xu et al.

Sea Surface Temperature (SST) prediction plays a vital role in climate modeling and disaster forecasting. However, it remains challenging due to its nonlinear spatiotemporal dynamics and extended prediction horizons. To address this, we propose OptFormer, a novel encoder-decoder model that integrates phase-space reconstruction with a motion-aware attention mechanism guided by optical flow. Unlike conventional attention, our approach leverages inter-frame motion cues to highlight relative changes in the spatial field, allowing the model to focus on dynamic regions and capture long-range temporal dependencies more effectively. Experiments on NOAA SST datasets across multiple spatial scales demonstrate that OptFormer achieves superior performance under a 1:1 training-to-prediction setting, significantly outperforming existing baselines in accuracy and robustness.

CVJan 11, 2025
SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors

Zhen Hong, Bowen Wang, Haoran Duan et al.

Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.

LGApr 23, 2025
STFM: A Spatio-Temporal Information Fusion Model Based on Phase Space Reconstruction for Sea Surface Temperature Prediction

Yin Wang, Chunlin Gong, Xiang Wu et al.

The sea surface temperature (SST), a key environmental parameter, is crucial to optimizing production planning, making its accurate prediction a vital research topic. However, the inherent nonlinearity of the marine dynamic system presents significant challenges. Current forecasting methods mainly include physics-based numerical simulations and data-driven machine learning approaches. The former, while describing SST evolution through differential equations, suffers from high computational complexity and limited applicability, whereas the latter, despite its computational benefits, requires large datasets and faces interpretability challenges. This study presents a prediction framework based solely on data-driven techniques. Using phase space reconstruction, we construct initial-delay attractor pairs with a mathematical homeomorphism and design a Spatio-Temporal Fusion Mapping (STFM) to uncover their intrinsic connections. Unlike conventional models, our method captures SST dynamics efficiently through phase space reconstruction and achieves high prediction accuracy with minimal training data in comparative tests

LGJan 27, 2025
ScaDyG:A New Paradigm for Large-scale Dynamic Graph Learning

Xiang Wu, Xunkai Li, Rong-Hua Li et al.

Dynamic graphs (DGs), which capture time-evolving relationships between graph entities, have widespread real-world applications. To efficiently encode DGs for downstream tasks, most dynamic graph neural networks follow the traditional message-passing mechanism and extend it with time-based techniques. Despite their effectiveness, the growth of historical interactions introduces significant scalability issues, particularly in industry scenarios. To address this limitation, we propose ScaDyG, with the core idea of designing a time-aware scalable learning paradigm as follows: 1) Time-aware Topology Reformulation: ScaDyG first segments historical interactions into time steps (intra and inter) based on dynamic modeling, enabling weight-free and time-aware graph propagation within pre-processing. 2) Dynamic Temporal Encoding: To further achieve fine-grained graph propagation within time steps, ScaDyG integrates temporal encoding through a combination of exponential functions in a scalable manner. 3) Hypernetwork-driven Message Aggregation: After obtaining the propagated features (i.e., messages), ScaDyG utilizes hypernetwork to analyze historical dependencies, implementing node-wise representation by an adaptive temporal fusion. Extensive experiments on 12 datasets demonstrate that ScaDyG performs comparably well or even outperforms other SOTA methods in both node and link-level downstream tasks, with fewer learnable parameters and higher efficiency.

LGApr 27, 2021
One Backward from Ten Forward, Subsampling for Large-Scale Deep Learning

Chaosheng Dong, Xiaojie Jin, Weihao Gao et al.

Deep learning models in large-scale machine learning systems are often continuously trained with enormous data from production environments. The sheer volume of streaming training data poses a significant challenge to real-time training subsystems and ad-hoc sampling is the standard practice. Our key insight is that these deployed ML systems continuously perform forward passes on data instances during inference, but ad-hoc sampling does not take advantage of this substantial computational effort. Therefore, we propose to record a constant amount of information per instance from these forward passes. The extra information measurably improves the selection of which data instances should participate in forward and backward passes. A novel optimization framework is proposed to analyze this problem and we provide an efficient approximation algorithm under the framework of Mini-batch gradient descent as a practical solution. We also demonstrate the effectiveness of our framework and algorithm on several large-scale classification and regression tasks, when compared with competitive baselines widely used in industry.

CVNov 23, 2020
Imbalance Robust Softmax for Deep Embeeding Learning

Hao Zhu, Yang Yuan, Guosheng Hu et al.

Deep embedding learning is expected to learn a metric space in which features have smaller maximal intra-class distance than minimal inter-class distance. In recent years, one research focus is to solve the open-set problem by discriminative deep embedding learning in the field of face recognition (FR) and person re-identification (re-ID). Apart from open-set problem, we find that imbalanced training data is another main factor causing the performance degradation of FR and re-ID, and data imbalance widely exists in the real applications. However, very little research explores why and how data imbalance influences the performance of FR and re-ID with softmax or its variants. In this work, we deeply investigate data imbalance in the perspective of neural network optimisation and feature distribution about softmax. We find one main reason of performance degradation caused by data imbalance is that the weights (from the penultimate fully-connected layer) are far from their class centers in feature space. Based on this investigation, we propose a unified framework, Imbalance-Robust Softmax (IR-Softmax), which can simultaneously solve the open-set problem and reduce the influence of data imbalance. IR-Softmax can generalise to any softmax and its variants (which are discriminative for open-set problem) by directly setting the weights as their class centers, naturally solving the data imbalance problem. In this work, we explicitly re-formulate two discriminative softmax (A-Softmax and AM-Softmax) under the framework of IR-Softmax. We conduct extensive experiments on FR databases (LFW, MegaFace) and re-ID database (Market-1501, Duke), and IR-Softmax outperforms many state-of-the-art methods.

CVSep 20, 2020
DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition

Chaoyou Fu, Xiang Wu, Yibo Hu et al.

Heterogeneous Face Recognition (HFR) refers to matching cross-domain faces and plays a crucial role in public security. Nevertheless, HFR is confronted with challenges from large domain discrepancy and insufficient heterogeneous data. In this paper, we formulate HFR as a dual generation problem, and tackle it via a novel Dual Variational Generation (DVG-Face) framework. Specifically, a dual variational generator is elaborately designed to learn the joint distribution of paired heterogeneous images. However, the small-scale paired heterogeneous training data may limit the identity diversity of sampling. In order to break through the limitation, we propose to integrate abundant identity information of large-scale visible data into the joint distribution. Furthermore, a pairwise identity preserving loss is imposed on the generated paired heterogeneous images to ensure their identity consistency. As a consequence, massive new diverse paired heterogeneous images with the same identity can be generated from noises. The identity consistency and identity diversity properties allow us to employ these generated images to train the HFR network via a contrastive learning mechanism, yielding both domain-invariant and discriminative embedding features. Concretely, the generated paired heterogeneous images are regarded as positive pairs, and the images obtained from different samplings are considered as negative pairs. Our method achieves superior performances over state-of-the-art methods on seven challenging databases belonging to five HFR tasks, including NIR-VIS, Sketch-Photo, Profile-Frontal Photo, Thermal-VIS, and ID-Camera. The related code will be released at https://github.com/BradyFU.

CVSep 17, 2020
Deep Momentum Uncertainty Hashing

Chaoyou Fu, Guoli Wang, Xiang Wu et al.

Combinatorial optimization (CO) has been a hot research topic because of its theoretic and practical importance. As a classic CO problem, deep hashing aims to find an optimal code for each data from finite discrete possibilities, while the discrete nature brings a big challenge to the optimization process. Previous methods usually mitigate this challenge by binary approximation, substituting binary codes for real-values via activation functions or regularizations. However, such approximation leads to uncertainty between real-values and binary ones, degrading retrieval performance. In this paper, we propose a novel Deep Momentum Uncertainty Hashing (DMUH). It explicitly estimates the uncertainty during training and leverages the uncertainty information to guide the approximation process. Specifically, we model bit-level uncertainty via measuring the discrepancy between the output of a hashing network and that of a momentum-updated network. The discrepancy of each bit indicates the uncertainty of the hashing network to the approximate output of that bit. Meanwhile, the mean discrepancy of all bits in a hashing code can be regarded as image-level uncertainty. It embodies the uncertainty of the hashing network to the corresponding input image. The hashing bit and image with higher uncertainty are paid more attention during optimization. To the best of our knowledge, this is the first work to study the uncertainty in hashing bits. Extensive experiments are conducted on four datasets to verify the superiority of our method, including CIFAR-10, NUS-WIDE, MS-COCO, and a million-scale dataset Clothing1M. Our method achieves the best performance on all of the datasets and surpasses existing state-of-the-art methods by a large margin.

CVDec 19, 2019
HAMBox: Delving into Online High-quality Anchors Mining for Detecting Outer Faces

Yang Liu, Xu Tang, Xiang Wu et al.

Current face detectors utilize anchors to frame a multi-task learning problem which combines classification and bounding box regression. Effective anchor design and anchor matching strategy enable face detectors to localize faces under large pose and scale variations. However, we observe that more than 80% correctly predicted bounding boxes are regressed from the unmatched anchors (the IoUs between anchors and target faces are lower than a threshold) in the inference phase. It indicates that these unmatched anchors perform excellent regression ability, but the existing methods neglect to learn from them. In this paper, we propose an Online High-quality Anchor Mining Strategy (HAMBox), which explicitly helps outer faces compensate with high-quality anchors. Our proposed HAMBox method could be a general strategy for anchor-based single-stage face detection. Experiments on various datasets, including WIDER FACE, FDDB, AFW and PASCAL Face, demonstrate the superiority of the proposed method. Furthermore, our team win the championship on the Face Detection test track of WIDER Face and Pedestrian Challenge 2019. We will release the codes with PaddlePaddle.

CVDec 18, 2019
Semantic Regularization: Improve Few-shot Image Classification by Reducing Meta Shift

Da Chen, Yongliang Yang, Zunlei Feng et al.

Few-shot image classification requires the classifier to robustly cope with unseen classes even if there are only a few samples for each class. Recent advances benefit from the meta-learning process where episodic tasks are formed to train a model that can adapt to class change. However, these task sare independent to each other and existing works mainly rely on limited samples of individual support set in a single meta task. This strategy leads to severe meta shift issues across multiple tasks, meaning the learned prototypes or class descriptors are not stable as each task only involves their own support set. To avoid this problem, we propose a concise Semantic RegularizationNetwork to learn a common semantic space under the framework of meta-learning. In this space, all class descriptors can be regularized by the learned semantic basis, which can effectively solve the meta shift problem. The key is to train a class encoder and decoder structure that can encode the sample embedding features into the semantic domain with trained semantic basis, and generate a more stable and general class descriptor from the decoder. We evaluate our work by extensive comparisons with previous methods on three benchmark datasets (MiniImageNet, TieredImageNet, and CUB). The results show that the semantic regularization module improves performance by 4%-7% over the baseline method, and achieves competitive results over the current state-of-the-art models.

CVJun 2, 2019
Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Feng Mao, Xiang Wu, Hui Xue et al.

High accuracy video label prediction (classification) models are attributed to large scale data. These data could be frame feature sequences extracted by a pre-trained convolutional-neural-network, which promote the efficiency for creating models. Unsupervised solutions such as feature average pooling, as a simple label-independent parameter-free based method, has limited ability to represent the video. While the supervised methods, like RNN, can greatly improve the recognition accuracy. However, the video length is usually long, and there are hierarchical relationships between frames across events in the video, the performance of RNN based models are decreased. In this paper, we proposes a novel video classification method based on a deep convolutional graph neural network(DCGN). The proposed method utilize the characteristics of the hierarchical structure of the video, and performed multi-level feature extraction on the video frame sequence through the graph network, obtained a video representation re ecting the event semantics hierarchically. We test our model on YouTube-8M Large-Scale Video Understanding dataset, and the result outperforms RNN based benchmarks.

MLApr 28, 2019
Low-Rank Principal Eigenmatrix Analysis

Krishna Balasubramanian, Elynn Y. Chen, Jianqing Fan et al.

Sparse PCA is a widely used technique for high-dimensional data analysis. In this paper, we propose a new method called low-rank principal eigenmatrix analysis. Different from sparse PCA, the dominant eigenvectors are allowed to be dense but are assumed to have a low-rank structure when matricized appropriately. Such a structure arises naturally in several practical cases: Indeed the top eigenvector of a circulant matrix, when matricized appropriately is a rank-1 matrix. We propose a matricized rank-truncated power method that could be efficiently implemented and establish its computational and statistical properties. Extensive experiments on several synthetic data sets demonstrate the competitive empirical performance of our method.

CVMar 30, 2019
M2FPA: A Multi-Yaw Multi-Pitch High-Quality Database and Benchmark for Facial Pose Analysis

Peipei Li, Xiang Wu, Yibo Hu et al.

Facial images in surveillance or mobile scenarios often have large view-point variations in terms of pitch and yaw angles. These jointly occurred angle variations make face recognition challenging. Current public face databases mainly consider the case of yaw variations. In this paper, a new large-scale Multi-yaw Multi-pitch high-quality database is proposed for Facial Pose Analysis (M2FPA), including face frontalization, face rotation, facial pose estimation and pose-invariant face recognition. It contains 397,544 images of 229 subjects with yaw, pitch, attribute, illumination and accessory. M2FPA is the most comprehensive multi-view face database for facial pose analysis. Further, we provide an effective benchmark for face frontalization and pose-invariant face recognition on M2FPA with several state-of-the-art methods, including DR-GAN, TP-GAN and CAPG-GAN. We believe that the new database and benchmark can significantly push forward the advance of facial pose analysis in real-world applications. Moreover, a simple yet effective parsing guided discriminator is introduced to capture the local consistency during GAN optimization. Extensive quantitative and qualitative results on M2FPA and Multi-PIE demonstrate the superiority of our face frontalization method. Baseline results for both face synthesis and face recognition from state-of-theart methods demonstrate the challenge offered by this new database.

CVMar 30, 2019
UVA: A Universal Variational Framework for Continuous Age Analysis

Peipei Li, Huaibo Huang, Yibo Hu et al.

Conventional methods for facial age analysis tend to utilize accurate age labels in a supervised way. However, existing age datasets lies in a limited range of ages, leading to a long-tailed distribution. To alleviate the problem, this paper proposes a Universal Variational Aging (UVA) framework to formulate facial age priors in a disentangling manner. Benefiting from the variational evidence lower bound, the facial images are encoded and disentangled into an age-irrelevant distribution and an age-related distribution in the latent space. A conditional introspective adversarial learning mechanism is introduced to boost the image quality. In this way, when manipulating the age-related distribution, UVA can achieve age translation with arbitrary ages. Further, by sampling noise from the age-irrelevant distribution, we can generate photorealistic facial images with a specific age. Moreover, given an input face image, the mean value of age-related distribution can be treated as an age estimator. These indicate that UVA can efficiently and accurately estimate the age-related distribution by a disentangling manner, even if the training dataset performs a long-tailed age distribution. UVA is the first attempt to achieve facial age analysis tasks, including age translation, age generation and age estimation, in a universal framework. The qualitative and quantitative experiments demonstrate the superiority of UVA on five popular datasets, including CACD2000, Morph, UTKFace, MegaAge-Asian and FG-NET.

CVMar 28, 2019
High Fidelity Face Manipulation with Extreme Poses and Expressions

Chaoyou Fu, Yibo Hu, Xiang Wu et al.

Face manipulation has shown remarkable advances with the flourish of Generative Adversarial Networks. However, due to the difficulties of controlling structures and textures, it is challenging to model poses and expressions simultaneously, especially for the extreme manipulation at high-resolution. In this paper, we propose a novel framework that simplifies face manipulation into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. The first stage models poses and expressions jointly via boundary images. Specifically, a conditional encoder-decoder network is employed to predict the boundary image of the target face in a semi-supervised way. Pose and expression estimators are introduced to improve the prediction performance. In the second stage, the predicted boundary image and the input face image are encoded into the structure and the texture latent space by two encoder networks, respectively. A proxy network and a feature threshold loss are further imposed to disentangle the latent space. Furthermore, due to the lack of high-resolution face manipulation databases to verify the effectiveness of our method, we collect a new high-quality Multi-View Face (MVF-HQ) database. It contains 120,283 images at 6000x4000 resolution from 479 identities with diverse poses, expressions, and illuminations. MVF-HQ is much larger in scale and much higher in resolution than publicly available high-resolution face manipulation databases. We will release MVF-HQ soon to push forward the advance of face manipulation. Qualitative and quantitative experiments on four databases show that our method dramatically improves the synthesis quality.

LGMar 25, 2019
Local Orthogonal Decomposition for Maximum Inner Product Search

Xiang Wu, Ruiqi Guo, Sanjiv Kumar et al.

Inverted file and asymmetric distance computation (IVFADC) have been successfully applied to approximate nearest neighbor search and subsequently maximum inner product search. In such a framework, vector quantization is used for coarse partitioning while product quantization is used for quantizing residuals. In the original IVFADC as well as all of its variants, after residuals are computed, the second production quantization step is completely independent of the first vector quantization step. In this work, we seek to exploit the connection between these two steps when we perform non-exhaustive search. More specifically, we decompose a residual vector locally into two orthogonal components and perform uniform quantization and multiscale quantization to each component respectively. The proposed method, called local orthogonal decomposition, combined with multiscale quantization consistently achieves higher recall than previous methods under the same bitrates. We conduct comprehensive experiments on large scale datasets as well as detailed ablation tests, demonstrating effectiveness of our method.

LGMar 20, 2019
Efficient Inner Product Approximation in Hybrid Spaces

Xiang Wu, Ruiqi Guo, David Simcha et al.

Many emerging use cases of data mining and machine learning operate on large datasets with data from heterogeneous sources, specifically with both sparse and dense components. For example, dense deep neural network embedding vectors are often used in conjunction with sparse textual features to provide high dimensional hybrid representation of documents. Efficient search in such hybrid spaces is very challenging as the techniques that perform well for sparse vectors have little overlap with those that work well for dense vectors. Popular techniques like Locality Sensitive Hashing (LSH) and its data-dependent variants also do not give good accuracy in high dimensional hybrid spaces. Even though hybrid scenarios are becoming more prevalent, currently there exist no efficient techniques in literature that are both fast and accurate. In this paper, we propose a technique that approximates the inner product computation in hybrid vectors, leading to substantial speedup in search while maintaining high accuracy. We also propose efficient data structures that exploit modern computer architectures, resulting in orders of magnitude faster search than the existing baselines. The performance of the proposed method is demonstrated on several datasets including a very large scale industrial dataset containing one billion vectors in a billion dimensional space, achieving over 10x speedup and higher accuracy against competitive baselines.

LGOct 11, 2018
A Blended Deep Learning Approach for Predicting User Intended Actions

Fei Tan, Zhi Wei, Jun He et al.

User intended actions are widely seen in many areas. Forecasting these actions and taking proactive measures to optimize business outcome is a crucial step towards sustaining the steady business growth. In this work, we focus on pre- dicting attrition, which is one of typical user intended actions. Conventional attrition predictive modeling strategies suffer a few inherent drawbacks. To overcome these limitations, we propose a novel end-to-end learning scheme to keep track of the evolution of attrition patterns for the predictive modeling. It integrates user activity logs, dynamic and static user profiles based on multi-path learning. It exploits historical user records by establishing a decaying multi-snapshot technique. And finally it employs the precedent user intentions via guiding them to the subsequent learning procedure. As a result, it addresses all disadvantages of conventional methods. We evaluate our methodology on two public data repositories and one private user usage dataset provided by Adobe Creative Cloud. The extensive experiments demonstrate that it can offer the appealing performance in comparison with several existing approaches as rated by different popular metrics. Furthermore, we introduce an advanced interpretation and visualization strategy to effectively characterize the periodicity of user activity logs. It can help to pinpoint important factors that are critical to user attrition and retention and thus suggests actionable improvement targets for business practice. Our work will provide useful insights into the prediction and elucidation of other user intended actions as well.

CVSep 7, 2018
Neurons Merging Layer: Towards Progressive Redundancy Reduction for Deep Supervised Hashing

Chaoyou Fu, Liangchen Song, Xiang Wu et al.

Deep supervised hashing has become an active topic in information retrieval. It generates hashing bits by the output neurons of a deep hashing network. During binary discretization, there often exists much redundancy between hashing bits that degenerates retrieval performance in terms of both storage and accuracy. This paper proposes a simple yet effective Neurons Merging Layer (NMLayer) for deep supervised hashing. A graph is constructed to represent the redundancy relationship between hashing bits that is used to guide the learning of a hashing network. Specifically, it is dynamically learned by a novel mechanism defined in our active and frozen phases. According to the learned relationship, the NMLayer merges the redundant neurons together to balance the importance of each output neuron. Moreover, multiple NMLayers are progressively trained for a deep hashing network to learn a more compact hashing code from a long redundant code. Extensive experiments on four datasets demonstrate that our proposed method outperforms state-of-the-art hashing methods.

CVSep 6, 2018
Disentangled Variational Representation for Heterogeneous Face Recognition

Xiang Wu, Huaibo Huang, Vishal M. Patel et al.

Visible (VIS) to near infrared (NIR) face matching is a challenging problem due to the significant domain discrepancy between the domains and a lack of sufficient data for training cross-modal matching algorithms. Existing approaches attempt to tackle this problem by either synthesizing visible faces from NIR faces, extracting domain-invariant features from these modalities, or projecting heterogeneous data onto a common latent space for cross-modal matching. In this paper, we take a different approach in which we make use of the Disentangled Variational Representation (DVR) for cross-modal matching. First, we model a face representation with an intrinsic identity information and its within-person variations. By exploring the disentangled latent variable space, a variational lower bound is employed to optimize the approximate posterior for NIR and VIS representations. Second, aiming at obtaining more compact and discriminative disentangled latent space, we impose a minimization of the identity information for the same subject and a relaxed correlation alignment constraint between the NIR and VIS modality variations. An alternative optimization scheme is proposed for the disentangled variational representation part and the heterogeneous face recognition network part. The mutual promotion between these two parts effectively reduces the NIR and VIS domain discrepancy and alleviates over-fitting. Extensive experiments on three challenging NIR-VIS heterogeneous face recognition databases demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods.

CVSep 12, 2017
Adversarial Discriminative Heterogeneous Face Recognition

Lingxiao Song, Man Zhang, Xiang Wu et al.

The gap between sensing patterns of different face modalities remains a challenging problem in heterogeneous face recognition (HFR). This paper proposes an adversarial discriminative feature learning framework to close the sensing gap via adversarial learning on both raw-pixel space and compact feature space. This framework integrates cross-spectral face hallucination and discriminative feature learning into an end-to-end adversarial network. In the pixel space, we make use of generative adversarial networks to perform cross-spectral face hallucination. An elaborate two-path model is introduced to alleviate the lack of paired images, which gives consideration to both global structures and local textures. In the feature space, an adversarial loss and a high-order variance discrepancy loss are employed to measure the global and local discrepancy between two heterogeneous distributions respectively. These two losses enhance domain-invariant feature learning and modality independent noise removing. Experimental results on three NIR-VIS databases show that our proposed approach outperforms state-of-the-art HFR methods, without requiring of complex network or large-scale training dataset.

CVSep 12, 2017
Anti-Makeup: Learning A Bi-Level Adversarial Network for Makeup-Invariant Face Verification

Yi Li, Lingxiao Song, Xiang Wu et al.

Makeup is widely used to improve facial attractiveness and is well accepted by the public. However, different makeup styles will result in significant facial appearance changes. It remains a challenging problem to match makeup and non-makeup face images. This paper proposes a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN). To alleviate the negative effects from makeup, we first generate non-makeup images from makeup ones, and then use the synthesized non-makeup images for further verification. Two adversarial networks in BLAN are integrated in an end-to-end deep network, with the one on pixel level for reconstructing appealing facial images and the other on feature level for preserving identity information. These two networks jointly reduce the sensing gap between makeup and non-makeup images. Moreover, we make the generator well constrained by incorporating multiple perceptual losses. Experimental results on three benchmark makeup face datasets demonstrate that our method achieves state-of-the-art verification accuracy across makeup status and can produce photo-realistic non-makeup face images.

CVAug 8, 2017
Wasserstein CNN: Learning Invariant Features for NIR-VIS Face Recognition

Ran He, Xiang Wu, Zhenan Sun et al.

Heterogeneous face recognition (HFR) aims to match facial images acquired from different sensing modalities with mission-critical applications in forensics, security and commercial sectors. However, HFR is a much more challenging problem than traditional face recognition because of large intra-class variations of heterogeneous face images and limited training samples of cross-modality face image pairs. This paper proposes a novel approach namely Wasserstein CNN (convolutional neural networks, or WCNN for short) to learn invariant features between near-infrared and visual face images (i.e. NIR-VIS face recognition). The low-level layers of WCNN are trained with widely available face images in visual spectrum. The high-level layer is divided into three parts, i.e., NIR layer, VIS layer and NIR-VIS shared layer. The first two layers aims to learn modality-specific features and NIR-VIS shared layer is designed to learn modality-invariant feature subspace. Wasserstein distance is introduced into NIR-VIS shared layer to measure the dissimilarity between heterogeneous feature distributions. So W-CNN learning aims to achieve the minimization of Wasserstein distance between NIR distribution and VIS distribution for invariant deep feature representation of heterogeneous face images. To avoid the over-fitting problem on small-scale heterogeneous face data, a correlation prior is introduced on the fully-connected layers of WCNN network to reduce parameter space. This prior is implemented by a low-rank constraint in an end-to-end network. The joint formulation leads to an alternating minimization for deep feature representation at training stage and an efficient computation for heterogeneous data at testing stage. Extensive experiments on three challenging NIR-VIS face recognition databases demonstrate the significant superiority of Wasserstein CNN over state-of-the-art methods.

CVApr 12, 2017
Attention-Set based Metric Learning for Video Face Recognition

Yibo Hu, Xiang Wu, Ran He

Face recognition has made great progress with the development of deep learning. However, video face recognition (VFR) is still an ongoing task due to various illumination, low-resolution, pose variations and motion blur. Most existing CNN-based VFR methods only obtain a feature vector from a single image and simply aggregate the features in a video, which less consider the correlations of face images in one video. In this paper, we propose a novel Attention-Set based Metric Learning (ASML) method to measure the statistical characteristics of image sets. It is a promising and generalized extension of Maximum Mean Discrepancy with memory attention weighting. First, we define an effective distance metric on image sets, which explicitly minimizes the intra-set distance and maximizes the inter-set distance simultaneously. Second, inspired by Neural Turing Machine, a Memory Attention Weighting is proposed to adapt set-aware global contents. Then ASML is naturally integrated into CNNs, resulting in an end-to-end learning scheme. Our method achieves state-of-the-art performance for the task of video face recognition on the three widely used benchmarks including YouTubeFace, YouTube Celebrities and Celebrity-1000.

CVApr 8, 2017
Coupled Deep Learning for Heterogeneous Face Recognition

Xiang Wu, Lingxiao Song, Ran He et al.

Heterogeneous face matching is a challenge issue in face recognition due to large domain difference as well as insufficient pairwise images in different modalities during training. This paper proposes a coupled deep learning (CDL) approach for the heterogeneous face matching. CDL seeks a shared feature space in which the heterogeneous face matching problem can be approximately treated as a homogeneous face matching problem. The objective function of CDL mainly includes two parts. The first part contains a trace norm and a block-diagonal prior as relevance constraints, which not only make unpaired images from multiple modalities be clustered and correlated, but also regularize the parameters to alleviate overfitting. An approximate variational formulation is introduced to deal with the difficulties of optimizing low-rank constraint directly. The second part contains a cross modal ranking among triplet domain specific images to maximize the margin for different identities and increase data for a small amount of training samples. Besides, an alternating minimization method is employed to iteratively update the parameters of CDL. Experimental results show that CDL achieves better performance on the challenging CASIA NIR-VIS 2.0 face recognition database, the IIIT-D Sketch database, the CUHK Face Sketch (CUFS), and the CUHK Face Sketch FERET (CUFSF), which significantly outperforms state-of-the-art heterogeneous face recognition methods.

CVNov 12, 2016
Learning Scene-specific Object Detectors Based on a Generative-Discriminative Model with Minimal Supervision

Dapeng Luo, Zhipeng Zeng, Nong Sang et al.

One object class may show large variations due to diverse illuminations, backgrounds and camera viewpoints. Traditional object detection methods often perform worse under unconstrained video environments. To address this problem, many modern approaches model deep hierarchical appearance representations for object detection. Most of these methods require a timeconsuming training process on large manual labelling sample set. In this paper, the proposed framework takes a remarkably different direction to resolve the multi-scene detection problem in a bottom-up fashion. First, a scene-specific objector is obtained from a fully autonomous learning process triggered by marking several bounding boxes around the object in the first video frame via a mouse. Here the human labeled training data or a generic detector are not needed. Second, this learning process is conveniently replicated many times in different surveillance scenes and results in particular detectors under various camera viewpoints. Thus, the proposed framework can be employed in multi-scene object detection applications with minimal supervision. Obviously, the initial scene-specific detector, initialized by several bounding boxes, exhibits poor detection performance and is difficult to improve with traditional online learning algorithm. Consequently, we propose Generative-Discriminative model to partition detection response space and assign each partition an individual descriptor that progressively achieves high classification accuracy. A novel online gradual optimized process is proposed to optimize the Generative-Discriminative model and focus on the hard samples.Experimental results on six video datasets show our approach achieves comparable performance to robust supervised methods, and outperforms the state of the art self-learning methods under varying imaging conditions.

CVJul 17, 2015
Learning Robust Deep Face Representation

Xiang Wu

With the development of convolution neural network, more and more researchers focus their attention on the advantage of CNN for face recognition task. In this paper, we propose a deep convolution network for learning a robust face representation. The deep convolution net is constructed by 4 convolution layers, 4 max pooling layers and 2 fully connected layers, which totally contains about 4M parameters. The Max-Feature-Map activation function is used instead of ReLU because the ReLU might lead to the loss of information due to the sparsity while the Max-Feature-Map can get the compact and discriminative feature vectors. The model is trained on CASIA-WebFace dataset and evaluated on LFW dataset. The result on LFW achieves 97.77% on unsupervised setting for single net.