Yao Xie

h-index28

56papers

1,205citations

Novelty51%

AI Score35

Ranked #107,980 of 194,257 authors (top 56%)#1,569 in ML (top 47%)

56 Papers

4.3APNov 4, 2023

Mobile Internet Quality Estimation using Self-Tuning Kernel Regression

Hanyang Jiang, Henry Shaowu Yuchi, Elizabeth Belding et al.

Modeling and estimation for spatial data are ubiquitous in real life, frequently appearing in weather forecasting, pollution detection, and agriculture. Spatial data analysis often involves processing datasets of enormous scale. In this work, we focus on large-scale internet-quality open datasets from Ookla. We look into estimating mobile (cellular) internet quality at the scale of a state in the United States. In particular, we aim to conduct estimation based on highly {\it imbalanced} data: Most of the samples are concentrated in limited areas, while very few are available in the rest, posing significant challenges to modeling efforts. We propose a new adaptive kernel regression approach that employs self-tuning kernels to alleviate the adverse effects of data imbalance in this problem. Through comparative experimentation on two distinct mobile network measurement datasets, we demonstrate that the proposed self-tuning kernel regression method produces more accurate predictions, with the potential to be applied in other applications.

7.4MLJun 20, 2023

Deep graph kernel point processes

Zheng Dong, Matthew Repasky, Xiuyuan Cheng et al.

Point process models are widely used for continuous asynchronous event data, where each data point includes time and additional information called "marks", which can be locations, nodes, or event types. This paper presents a novel point process model for discrete event data over graphs, where the event interaction occurs within a latent graph structure. Our model builds upon Hawkes's classic influence kernel-based formulation in the original self-exciting point processes work to capture the influence of historical events on future events' occurrence. The key idea is to represent the influence kernel by Graph Neural Networks (GNN) to capture the underlying graph structure while harvesting the strong representation power of GNNs. Compared with prior works focusing on directly modeling the conditional intensity function using neural networks, our kernel presentation herds the repeated event influence patterns more effectively by combining statistical and deep models, achieving better model estimation/learning efficiency and superior predictive performance. Our work significantly extends the existing deep spatio-temporal kernel for point process data, which is inapplicable to our setting due to the fundamental difference in the nature of the observation space being Euclidean rather than a graph. We present comprehensive experiments on synthetic and real-world data to show the superior performance of the proposed approach against the state-of-the-art in predicting future events and uncovering the relational structure among data.

1.2SPJan 8, 2019

Distributed Change Detection via Average Consensus over Networks

Qinghua Liu, Rui Zhang, Yao Xie

Distributed change-point detection has been a fundamental problem when performing real-time monitoring using sensor-networks. We propose a distributed detection algorithm, where each sensor only exchanges CUSUM statistic with their neighbors based on the average consensus scheme, and an alarm is raised when local consensus statistic exceeds a pre-specified global threshold. We provide theoretical performance bounds showing that the performance of the fully distributed scheme can match the centralized algorithms under some mild conditions. Numerical experiments demonstrate the good performance of the algorithm especially in detecting asynchronous changes.

4.6LGAug 19, 2024

Regularization for Adversarial Robust Learning

Jie Wang, Rui Gao, Yao Xie

Despite the growing prevalence of artificial neural networks in real-world applications, their vulnerability to adversarial attacks remains a significant concern, which motivates us to investigate the robustness of machine learning models. While various heuristics aim to optimize the distributionally robust risk using the $\infty$-Wasserstein metric, such a notion of robustness frequently encounters computation intractability. To tackle the computational challenge, we develop a novel approach to adversarial training that integrates $φ$-divergence regularization into the distributionally robust risk function. This regularization brings a notable improvement in computation compared with the original formulation. We develop stochastic gradient methods with biased oracles to solve this problem efficiently, achieving the near-optimal sample complexity. Moreover, we establish its regularization effects and demonstrate it is asymptotic equivalence to a regularized empirical risk minimization framework, by considering various scaling regimes of the regularization parameter and robustness level. These regimes yield gradient norm regularization, variance regularization, or a smoothed gradient norm regularization that interpolates between these extremes. We numerically validate our proposed method in supervised learning, reinforcement learning, and contextual learning and showcase its state-of-the-art performance against various adversarial attacks.

7.8MLApr 8, 2025

Deep spatio-temporal point processes: Advances and new directions

Xiuyuan Cheng, Zheng Dong, Yao Xie

Spatio-temporal point processes (STPPs) model discrete events distributed in time and space, with important applications in areas such as criminology, seismology, epidemiology, and social networks. Traditional models often rely on parametric kernels, limiting their ability to capture heterogeneous, nonstationary dynamics. Recent innovations integrate deep neural architectures -- either by modeling the conditional intensity function directly or by learning flexible, data-driven influence kernels, substantially broadening their expressive power. This article reviews the development of the deep influence kernel approach, which enjoys statistical explainability, since the influence kernel remains in the model to capture the spatiotemporal propagation of event influence and its impact on future events, while also possessing strong expressive power, thereby benefiting from both worlds. We explain the main components in developing deep kernel point processes, leveraging tools such as functional basis decomposition and graph neural networks to encode complex spatial or network structures, as well as estimation using both likelihood-based and likelihood-free methods, and address computational scalability for large-scale data. We also discuss the theoretical foundation of kernel identifiability. Simulated and real-data examples highlight applications to crime analysis, earthquake aftershock prediction, and sepsis prediction modeling, and we conclude by discussing promising directions for the field.

12.0MLDec 29, 2024

Distributionally Robust Optimization via Iterative Algorithms in Continuous Probability Spaces

Linglingzhi Zhu, Yao Xie

We consider a minimax problem motivated by distributionally robust optimization (DRO) when the worst-case distribution is continuous, leading to significant computational challenges due to the infinite-dimensional nature of the optimization problem. Recent research has explored learning the worst-case distribution using neural network-based generative models to address these computational challenges but lacks algorithmic convergence guarantees. This paper bridges this theoretical gap by presenting an iterative algorithm to solve such a minimax problem, achieving global convergence under mild assumptions and leveraging technical tools from vector space minimax optimization and convex analysis in the space of continuous probability densities. In particular, leveraging Brenier's theorem, we represent the worst-case distribution as a transport map applied to a continuous reference measure and reformulate the regularized discrepancy-based DRO as a minimax problem in the Wasserstein space. Furthermore, we demonstrate that the worst-case distribution can be efficiently computed using a modified Jordan-Kinderlehrer-Otto (JKO) scheme with sufficiently large regularization parameters for commonly used discrepancy functions, linked to the radius of the ambiguity set. Additionally, we derive the global convergence rate and quantify the total number of subgradient and inexact modified JKO iterations required to obtain approximate stationary points. These results are potentially applicable to nonconvex and nonsmooth scenarios, with broad relevance to modern machine learning applications.

9.4LGMar 18, 2025Code

Sepsyn-OLCP: An Online Learning-based Framework for Early Sepsis Prediction with Uncertainty Quantification using Conformal Prediction

Anni Zhou, Beyah Raheem, Rishikesan Kamaleswaran et al.

Sepsis is a life-threatening syndrome with high morbidity and mortality in hospitals. Early prediction of sepsis plays a crucial role in facilitating early interventions for septic patients. However, early sepsis prediction systems with uncertainty quantification and adaptive learning are scarce. This paper proposes Sepsyn-OLCP, a novel online learning algorithm for early sepsis prediction by integrating conformal prediction for uncertainty quantification and Bayesian bandits for adaptive decision-making. By combining the robustness of Bayesian models with the statistical uncertainty guarantees of conformal prediction methodologies, this algorithm delivers accurate and trustworthy predictions, addressing the critical need for reliable and adaptive systems in high-stakes healthcare applications such as early sepsis prediction. We evaluate the performance of Sepsyn-OLCP in terms of regret in stochastic bandit setting, the area under the receiver operating characteristic curve (AUROC), and F-measure. Our results show that Sepsyn-OLCP outperforms existing individual models, increasing AUROC of a neural network from 0.64 to 0.73 without retraining and high computational costs. And the model selection policy converges to the optimal strategy in the long run. We propose a novel reinforcement learning-based framework integrated with conformal prediction techniques to provide uncertainty quantification for early sepsis prediction. The proposed methodology delivers accurate and trustworthy predictions, addressing a critical need in high-stakes healthcare applications like early sepsis prediction.

11.4LGFeb 13, 2025

Neural Spatiotemporal Point Processes: Trends and Challenges

Sumantrak Mukherjee, Mouad Elhamdi, George Mohler et al.

Spatiotemporal point processes (STPPs) are probabilistic models for events occurring in continuous space and time. Real-world event data often exhibit intricate dependencies and heterogeneous dynamics. By incorporating modern deep learning techniques, STPPs can model these complexities more effectively than traditional approaches. Consequently, the fusion of neural methods with STPPs has become an active and rapidly evolving research area. In this review, we categorize existing approaches, unify key design choices, and explain the challenges of working with this data modality. We further highlight emerging trends and diverse application domains. Finally, we identify open challenges and gaps in the literature.

3.1MLNov 5, 2024

Point processes with event time uncertainty

Xiuyuan Cheng, Tingnan Gong, Yao Xie

Point processes are widely used statistical models for uncovering the temporal patterns in dependent event data. In many applications, the event time cannot be observed exactly, calling for the incorporation of time uncertainty into the modeling of point process data. In this work, we introduce a framework to model time-uncertain point processes possibly on a network. We start by deriving the formulation in the continuous-time setting under a few assumptions motivated by application scenarios. After imposing a time grid, we obtain a discrete-time model that facilitates inference and can be computed by first-order optimization methods such as Gradient Descent or Variation inequality (VI) using batch-based Stochastic Gradient Descent (SGD). The parameter recovery guarantee is proved for VI inference at an $O(1/k)$ convergence rate using $k$ SGD steps. Our framework handles non-stationary processes by modeling the inference kernel as a matrix (or tensor on a network) and it covers the stationary process, such as the classical Hawkes process, as a special case. We experimentally show that the proposed approach outperforms previous General Linear model (GLM) baselines on simulated and real data and reveals meaningful causal relations on a Sepsis-associated Derangements dataset.

1.2MEApr 12, 2025

Graph-Based Prediction Models for Data Debiasing

Dongze Wu, Hanyang Jiang, Yao Xie

Bias in data collection, arising from both under-reporting and over-reporting, poses significant challenges in critical applications such as healthcare and public safety. In this work, we introduce Graph-based Over- and Under-reporting Debiasing (GROUD), a novel graph-based optimization framework that debiases reported data by jointly estimating the true incident counts and the associated reporting bias probabilities. By modeling the bias as a smooth signal over a graph constructed from geophysical or feature-based similarities, our convex formulation not only ensures a unique solution but also comes with theoretical recovery guarantees under certain assumptions. We validate GROUD on both challenging simulated experiments and real-world datasets -- including Atlanta emergency calls and COVID-19 vaccine adverse event reports -- demonstrating its robustness and superior performance in accurately recovering debiased counts. This approach paves the way for more reliable downstream decision-making in systems affected by reporting irregularities.

2.6LGJun 11, 2024

Nonlinear time-series embedding by monotone variational inequality

Jonathan Y. Zhou, Yao Xie

In the wild, we often encounter collections of sequential data such as electrocardiograms, motion capture, genomes, and natural language, and sequences may be multichannel or symbolic with nonlinear dynamics. We introduce a new method to learn low-dimensional representations of nonlinear time series without supervision and can have provable recovery guarantees. The learned representation can be used for downstream machine-learning tasks such as clustering and classification. The method is based on the assumption that the observed sequences arise from a common domain, but each sequence obeys its own autoregressive models that are related to each other through low-rank regularization. We cast the problem as a computationally efficient convex matrix parameter recovery problem using monotone Variational Inequality and encode the common domain assumption via low-rank constraint across the learned representations, which can learn the geometry for the entire domain as well as faithful representations for the dynamics of each individual sequence using the domain information in totality. We show the competitive performance of our method on real-world time-series data with the baselines and demonstrate its effectiveness for symbolic text modeling and RNA sequence clustering.

20.3OCSep 24, 2021Code

Sinkhorn Distributionally Robust Optimization

Jie Wang, Rui Gao, Yao Xie

We study distributionally robust optimization with Sinkhorn distance -- a variant of Wasserstein distance based on entropic regularization. We derive a convex programming dual reformulation for general nominal distributions, transport costs, and loss functions. To solve the dual reformulation, we develop a stochastic mirror descent algorithm with biased subgradient estimators and derive its computational complexity guarantees. Finally, we provide numerical examples using synthetic and real data to demonstrate its superior performance.

7.5LGSep 8, 2021

Class-conditioned Domain Generalization via Wasserstein Distributional Robust Optimization

Jingge Wang, Yang Li, Liyan Xie et al.

Given multiple source domains, domain generalization aims at learning a universal model that performs well on any unseen but related target domain. In this work, we focus on the domain generalization scenario where domain shifts occur among class-conditional distributions of different domains. Existing approaches are not sufficiently robust when the variation of conditional distributions given the same class is large. In this work, we extend the concept of distributional robust optimization to solve the class-conditional domain generalization problem. Our approach optimizes the worst-case performance of a classifier over class-conditional distributions within a Wasserstein ball centered around the barycenter of the source conditional distributions. We also propose an iterative algorithm for learning the optimal radius of the Wasserstein balls automatically. Experiments show that the proposed framework has better performance on unseen target domain than approaches without domain generalization.

9.9LGJun 20, 2021Code

Neural Spectral Marked Point Processes

Shixiang Zhu, Haoyun Wang, Zheng Dong et al.

Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.

14.9MLJun 6, 2021Code

Neural Tangent Kernel Maximum Mean Discrepancy

Xiuyuan Cheng, Yao Xie

We present a novel neural network Maximum Mean Discrepancy (MMD) statistic by identifying a new connection between neural tangent kernel (NTK) and MMD. This connection enables us to develop a computationally efficient and memory-efficient approach to compute the MMD statistic and perform NTK based two-sample tests towards addressing the long-standing challenge of memory and computational complexity of the MMD statistic, which is essential for online implementation to assimilating new samples. Theoretically, such a connection allows us to understand the NTK test statistic properties, such as the Type-I error and testing power for performing the two-sample test, by adapting existing theories for kernel MMD. Numerical experiments on synthetic and real-world datasets validate the theory and demonstrate the effectiveness of the proposed NTK-MMD statistic.

6.3MLMay 31, 2021

Early Detection of COVID-19 Hotspots Using Spatio-Temporal Data

Shixiang Zhu, Alexander Bukharin, Liyan Xie et al.

Recently, the Centers for Disease Control and Prevention (CDC) has worked with other federal agencies to identify counties with increasing coronavirus disease 2019 (COVID-19) incidence (hotspots) and offers support to local health departments to limit the spread of the disease. Understanding the spatio-temporal dynamics of hotspot events is of great importance to support policy decisions and prevent large-scale outbreaks. This paper presents a spatio-temporal Bayesian framework for early detection of COVID-19 hotspots (at the county level) in the United States. We assume both the observed number of cases and hotspots depend on a class of latent random variables, which encode the underlying spatio-temporal dynamics of the transmission of COVID-19. Such latent variables follow a zero-mean Gaussian process, whose covariance is specified by a non-stationary kernel function. The most salient feature of our kernel function is that deep neural networks are introduced to enhance the model's representative power while still enjoying the interpretability of the kernel. We derive a sparse model and fit the model using a variational learning strategy to circumvent the computational intractability for large data sets. Our model demonstrates better interpretability and superior hotspot-detection performance compared to other baseline methods.

8.6APMay 25, 2021

Conformal Anomaly Detection on Spatio-Temporal Observations with Missing Data

Chen Xu, Yao Xie

We develop a distribution-free, unsupervised anomaly detection method called ECAD, which wraps around any regression algorithm and sequentially detects anomalies. Rooted in conformal prediction, ECAD does not require data exchangeability but approximately controls the Type-I error when data are normal. Computationally, it involves no data-splitting and efficiently trains ensemble predictors to increase statistical power. We demonstrate the superior performance of ECAD on detecting anomalous spatio-temporal traffic flow.

1.4CVApr 18, 2021

Signal Processing Challenges and Examples for {\it in-situ} Transmission Electron Microscopy

Josh Kacher, Yao Xie, Sven P. Voigt et al.

Transmission Electron Microscopy (TEM) is a powerful tool for imaging material structure and characterizing material chemistry. Recent advances in data collection technology for TEM have enabled high-volume and high-resolution data collection at a microsecond frame rate. Taking advantage of these advances in data collection rates requires the development and application of data processing tools, including image analysis, feature extraction, and streaming data processing techniques. In this paper, we highlight a few areas in materials science that have benefited from combining signal processing and statistical analysis with data collection capabilities in TEM and present a future outlook on opportunities of integrating signal processing with automated TEM data analysis.

6.3MLFeb 10, 2021

Sequential change-point detection for mutually exciting point processes over networks

Haoyun Wang, Liyan Xie, Yao Xie et al.

We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abrupt changes in Hawkes networks arises from various applications, including neuronal imaging, sensor network, and social network monitoring. Despite this, there has not been a computationally and memory-efficient online algorithm for detecting such changes from sequential data. We present an efficient online recursive implementation of the CUSUM statistic for Hawkes processes, both decentralized and memory-efficient, and establish the theoretical properties of this new CUSUM procedure. We then show that the proposed CUSUM method achieves better performance than existing methods, including the Shewhart procedure based on count data, the generalized likelihood ratio (GLR) in the existing literature, and the standard score statistic. We demonstrate this via a simulated example and an application to population code change-detection in neuronal networks.

2.7MLNov 24, 2020

Tensor Kernel Recovery for Spatio-Temporal Hawkes Processes

Heejune Sheen, Xiaonan Zhu, Yao Xie

We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel. We assume a low-rank structure for the tensor kernel and cast the estimation problem as a convex optimization problem using the Fourier transformed nuclear norm (TNN). We provide theoretical performance guarantees for our approach and present an algorithm to solve the optimization problem. Moreover, we demonstrate the efficiency of our estimation with numerical simulations.

12.5MLOct 22, 2020

Two-sample Test using Projected Wasserstein Distance

Jie Wang, Rui Gao, Yao Xie

We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. In particular, we aim to circumvent the curse of dimensionality in Wasserstein distance: when the dimension is high, it has diminishing testing power, which is inherently due to the slow concentration property of Wasserstein metrics in the high dimension space. A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions. We characterize the theoretical property of the finite-sample convergence rate on IPMs and present practical algorithms for computing this metric. Numerical examples validate our theoretical results.

5.1STJun 16, 2020

Goodness-of-Fit Test for Mismatched Self-Exciting Processes

Song Wei, Shixiang Zhu, Minghe Zhang et al.

Recently there have been many research efforts in developing generative models for self-exciting point processes, partly due to their broad applicability for real-world applications. However, rarely can we quantify how well the generative model captures the nature or ground-truth since it is usually unknown. The challenge typically lies in the fact that the generative models typically provide, at most, good approximations to the ground-truth (e.g., through the rich representative power of neural networks), but they cannot be precisely the ground-truth. We thus cannot use the classic goodness-of-fit (GOF) test framework to evaluate their performance. In this paper, we develop a GOF test for generative models of self-exciting processes by making a new connection to this problem with the classical statistical theory of Quasi-maximum-likelihood estimator (QMLE). We present a non-parametric self-normalizing statistic for the GOF test: the Generalized Score (GS) statistics, and explicitly capture the model misspecification when establishing the asymptotic distribution of the GS statistic. Numerical simulation and real-data experiments validate our theory and demonstrate the proposed GS test's good performance.

10.3MLJun 12, 2020

Uncertainty Quantification for Inferring Hawkes Networks

Haoyun Wang, Liyan Xie, Alex Cuozzo et al.

Multivariate Hawkes processes are commonly used to model streaming networked event data in a wide variety of applications. However, it remains a challenge to extract reliable inference from complex datasets with uncertainty quantification. Aiming towards this, we develop a statistical inference framework to learn causal relationships between nodes from networked data, where the underlying directed graph implies Granger causality. We provide uncertainty quantification for the maximum likelihood estimate of the network multivariate Hawkes process by providing a non-asymptotic confidence set. The main technique is based on the concentration inequalities of continuous-time martingales. We compare our method to the previously-derived asymptotic Hawkes process confidence interval, and demonstrate the strengths of our method in an application to neuronal connectivity reconstruction.

5.8MLJun 7, 2020

Distributionally Robust Weighted $k$-Nearest Neighbors

Shixiang Zhu, Liyan Xie, Minghe Zhang et al.

Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing $k$-nearest neighbor ($k$-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted $k$-nearest neighbors, which aims to find the optimal weighted $k$-NN classifiers that hedge against feature uncertainties. We develop an algorithm, \texttt{Dr.k-NN}, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. When the size of the uncertainty set is properly tuned, the robust classifier has a smaller Lipschitz norm than the vanilla $k$-NN, and thus improves the generalization capability. We also couple our framework with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm compared to the state-of-the-art in the few-training-sample setting with various real-data experiments.

9.0LGMay 15, 2020

Spatio-Temporal Point Processes with Attention for Traffic Congestion Event Modeling

Shixiang Zhu, Ruyi Ding, Minghe Zhang et al.

We present a novel framework for modeling traffic congestion events over road networks. Using multi-modal data by combining count data from traffic sensors with police reports that report traffic incidents, we aim to capture two types of triggering effect for congestion events. Current traffic congestion at one location may cause future congestion over the road network, and traffic incidents may cause spread traffic congestion. To model the non-homogeneous temporal dependence of the event on the past, we use a novel attention-based mechanism based on neural networks embedding for point processes. To incorporate the directional spatial dependence induced by the road network, we adapt the "tail-up" model from the context of spatial statistics to the traffic network setting. We demonstrate our approach's superior performance compared to the state-of-the-art methods for both synthetic and real data.

11.3STMar 29, 2020

Convex Parameter Recovery for Interacting Marked Processes

Anatoli Juditsky, Arkadi Nemirovski, Liyan Xie et al.

We introduce a new general modeling approach for multivariate discrete event data with categorical interacting marks, which we refer to as marked Bernoulli processes. In the proposed model, the probability of an event of a specific category to occur in a location may be influenced by past events at this and other locations. We do not restrict interactions to be positive or decaying over time as it is commonly adopted, allowing us to capture an arbitrary shape of influence from historical events, locations, and events of different categories. In our modeling, prior knowledge is incorporated by allowing general convex constraints on model parameters. We develop two parameter estimation procedures utilizing the constrained Least Squares (LS) and Maximum Likelihood (ML) estimation, which are solved using variational inequalities with monotone operators. We discuss different applications of our approach and illustrate the performance of proposed recovery routines on synthetic examples and a real-world police dataset.

15.1MLFeb 17, 2020

Deep Fourier Kernel for Self-Attentive Point Processes

Shixiang Zhu, Minghe Zhang, Ruyi Ding et al.

We present a novel attention-based model for discrete event data to capture complex non-linear temporal dependence structures. We borrow the idea from the attention mechanism and incorporate it into the point processes' conditional intensity function. We further introduce a novel score function using Fourier kernel embedding, whose spectrum is represented using neural networks, which drastically differs from the traditional dot-product kernel and can capture a more complex similarity structure. We establish our approach's theoretical properties and demonstrate our approach's competitive performance compared to the state-of-the-art for synthetic and real data.

8.3MLOct 21, 2019

Sequential Adversarial Anomaly Detection for One-Class Event Data

Shixiang Zhu, Henry Shaowu Yuchi, Minghe Zhang et al.

We consider the sequential anomaly detection problem in the one-class setting when only the anomalous sequences are available and propose an adversarial sequential detector by solving a minimax problem to find an optimal detector against the worst-case sequences from a generator. The generator captures the dependence in sequential events using the marked point process model. The detector sequentially evaluates the likelihood of a test sequence and compares it with a time-varying threshold, also learned from data through the minimax problem. We demonstrate our proposed method's good performance using numerical experiments on simulations and proprietary large-scale credit card fraud datasets. The proposed method can generally apply to detecting anomalous sequences.

4.9MLSep 11, 2019

Goodness-of-fit tests on manifolds

Alexander Shapiro, Yao Xie, Rui Zhang

We develop a general theory for the goodness-of-fit test to non-linear models. In particular, we assume that the observations are noisy samples of a submanifold defined by a \yao{sufficiently smooth non-linear map}. The observation noise is additive Gaussian. Our main result shows that the "residual" of the model fit, by solving a non-linear least-square problem, follows a (possibly noncentral) $χ^2$ distribution. The parameters of the $χ^2$ distribution are related to the model order and dimension of the problem. We further present a method to select the model orders sequentially. We demonstrate the broad application of the general theory in machine learning and signal processing, including determining the rank of low-rank (possibly complex-valued) matrices and tensors from noisy, partial, or indirect observations, determining the number of sources in signal demixing, and potential applications in determining the number of hidden nodes in neural networks.

11.5LGJun 13, 2019Code

Imitation Learning of Neural Spatio-Temporal Point Processes

Shixiang Zhu, Shuang Li, Zhigang Peng et al.

We present a novel Neural Embedding Spatio-Temporal (NEST) point process model for spatio-temporal discrete event data and develop an efficient imitation learning (a type of reinforcement learning) based approach for model fitting. Despite the rapid development of one-dimensional temporal point processes for discrete event data, the study of spatial-temporal aspects of such data is relatively scarce. Our model captures complex spatio-temporal dependence between discrete events by carefully design a mixture of heterogeneous Gaussian diffusion kernels, whose parameters are parameterized by neural networks. This new kernel is the key that our model can capture intricate spatial dependence patterns and yet still lead to interpretable results as we examine maps of Gaussian diffusion kernel parameters. The imitation learning model fitting for the NEST is more robust than the maximum likelihood estimate. It directly measures the divergence between the empirical distributions between the training data and the model-generated data. Moreover, our imitation learning-based approach enjoys computational efficiency due to the explicit characterization of the reward function related to the likelihood function; furthermore, the likelihood function under our model enjoys tractable expression due to Gaussian kernel parameterization. Experiments based on real data show our method's good performance relative to the state-of-the-art and the good interpretability of NEST's result.

4.1CVMay 11, 2019

Deep Zero-Shot Learning for Scene Sketch

Yao Xie, Peng Xu, Zhanyu Ma

We introduce a novel problem of scene sketch zero-shot learning (SSZSL), which is a challenging task, since (i) different from photo, the gap between common semantic domain (e.g., word vector) and sketch is too huge to exploit common semantic knowledge as the bridge for knowledge transfer, and (ii) compared with single-object sketch, more expressive feature representation for scene sketch is required to accommodate its high-level of abstraction and complexity. To overcome these challenges, we propose a deep embedding model for scene sketch zero-shot learning. In particular, we propose the augmented semantic vector to conduct domain alignment by fusing multi-modal semantic knowledge (e.g., cartoon image, natural image, text description), and adopt attention-based network for scene sketch feature learning. Moreover, we propose a novel distance metric to improve the similarity measure during testing. Extensive experiments and ablation studies demonstrate the benefit of our sketch-specific design.

10.4MLFeb 1, 2019

Spatial-Temporal-Textual Point Processes for Crime Linkage Detection

Shixiang Zhu, Yao Xie

Crimes emerge out of complex interactions of human behaviors and situations. Linkages between crime incidents are highly complex. Detecting crime linkage given a set of incidents is a highly challenging task since we only have limited information, including text descriptions, incident times, and locations. In practice, there are very few labels. We propose a new statistical modeling framework for {\it spatio-temporal-textual} data and demonstrate its usage on crime linkage detection. We capture linkages of crime incidents via multivariate marked spatio-temporal Hawkes processes and treat embedding vectors of the free-text as {\it marks} of the incident, inspired by the notion of {\it modus operandi} (M.O.) in crime analysis. Numerical results using real data demonstrate the good performance of our method as well as reveals interesting patterns in the crime data: the joint modeling of space, time, and text information enhances crime linkage detection compared with the state-of-the-art, and the learned spatial dependence from data can be useful for police operations.

17.1CVJan 27, 2019Code

Learning Transformation Synchronization

Xiangru Huang, Zhenxiao Liang, Xiaowei Zhou et al.

Reconstructing the 3D model of a physical object typically requires us to align the depth scans obtained from different camera poses into the same coordinate system. Solutions to this global alignment problem usually proceed in two steps. The first step estimates relative transformations between pairs of scans using an off-the-shelf technique. Due to limited information presented between pairs of scans, the resulting relative transformations are generally noisy. The second step then jointly optimizes the relative transformations among all input depth scans. A natural constraint used in this step is the cycle-consistency constraint, which allows us to prune incorrect relative transformations by detecting inconsistent cycles. The performance of such approaches, however, heavily relies on the quality of the input relative transformations. Instead of merely using the relative transformations as the input to perform transformation synchronization, we propose to use a neural network to learn the weights associated with each relative transformation. Our approach alternates between transformation synchronization using weighted relative transformations and predicting new weights of the input relative transformations using a neural network. We demonstrate the usefulness of this approach across a wide range of datasets.

20.4LGNov 12, 2018

Learning Temporal Point Processes via Reinforcement Learning

Shuang Li, Shuai Xiao, Shixiang Zhu et al.

Social goods, such as healthcare, smart city, and information networks, often produce ordered event data in continuous time. The generative processes of these event data can be very complex, requiring flexible models to capture their dynamics. Temporal point processes offer an elegant framework for modeling event data without discretizing the time. However, the existing maximum-likelihood-estimation (MLE) learning paradigm requires hand-crafting the intensity function beforehand and cannot directly monitor the goodness-of-fit of the estimated model in the process of training. To alleviate the risk of model-misspecification in MLE, we propose to generate samples from the generative model and monitor the quality of the samples in the process of training until the samples and the real data are indistinguishable. We take inspiration from reinforcement learning (RL) and treat the generation of each event as the action taken by a stochastic policy. We parameterize the policy as a flexible recurrent neural network and gradually improve the policy to mimic the observed event distribution. Since the reward function is unknown in this setting, we uncover an analytic and nonparametric form of the reward function using an inverse reinforcement learning formulation. This new RL framework allows us to derive an efficient policy gradient algorithm for learning flexible point process models, and we show that it performs well in both synthetic and real data.

5.5MLJun 15, 2018

Crime Event Embedding with Unsupervised Feature Selection

Shixiang Zhu, Yao Xie

We present a novel event embedding algorithm for crime data that can jointly capture time, location, and the complex free-text component of each event. The embedding is achieved by regularized Restricted Boltzmann Machines (RBMs), and we introduce a new way to regularize by imposing a $\ell_1$ penalty on the conditional distributions of the observed variables of RBMs. This choice of regularization performs feature selection and it also leads to efficient computation since the gradient can be computed in a closed form. The feature selection forces embedding to be based on the most important keywords, which captures the common modus operandi (M. O.) in crime series. Using numerical experiments on a large-scale crime dataset, we show that our regularized RBMs can achieve better event embedding and the selected features are highly interpretable from human understanding.

20.4MLMay 27, 2018

Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Rui Gao, Liyan Xie, Yao Xie et al.

We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.

19.7MLFeb 11, 2018

Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Yang Cao, Zheng Wen, Branislav Kveton et al.

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset to demonstrate its superior performance.

6.6MLOct 28, 2017

Crime incidents embedding using restricted Boltzmann machines

Shixiang Zhu, Yao Xie

We present a new approach for detecting related crime series, by unsupervised learning of the latent feature embeddings from narratives of crime record via the Gaussian-Bernoulli Restricted Boltzmann Machines (RBM). This is a drastically different approach from prior work on crime analysis, which typically considers only time and location and at most category information. After the embedding, related cases are closer to each other in the Euclidean feature space, and the unrelated cases are far apart, which is a good property can enable subsequent analysis such as detection and clustering of related cases. Experiments over several series of related crime incidents hand labeled by the Atlanta Police Department reveal the promise of our embedding methods.

1.2STJun 15, 2017

Sequential detection of low-rank changes using extreme eigenvalues

Liyan Xie, Yao Xie

We study the problem of detecting an abrupt change to the signal covariance matrix. In particular, the covariance changes from a "white" identity matrix to an unknown spiked or low-rank matrix. Two sequential change-point detection procedures are presented, based on the largest and the smallest eigenvalues of the sample covariance matrix. To control false-alarm-rate, we present an accurate theoretical approximation to the average-run-length (ARL) and expected detection delay (EDD) of the detection, leveraging the extreme eigenvalue distributions from random matrix theory and by capturing a non-negligible temporal correlation in the sequence of scan statistics due to the sliding window approach. Real data examples demonstrate the good performance of our method for detecting behavior change of a swarm.

1.2STMay 19, 2017

Nearly second-order asymptotic optimality of sequential change-point detection with one-sample updates

Yang Cao, Liyan Xie, Yao Xie et al.

Sequential change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. When the post-change parameters are unknown, we consider a set of detection procedures based on sequential likelihood ratios with non-anticipating estimators constructed using online convex optimization algorithms such as online mirror descent, which provides a more versatile approach to tackle complex situations where recursive maximum likelihood estimators cannot be found. When the underlying distributions belong to a exponential family and the estimators satisfy the logarithm regret property, we show that this approach is nearly second-order asymptotically optimal. This means that the upper bound for the false alarm rate of the algorithm (measured by the average-run-length) meets the lower bound asymptotically up to a log-log factor when the threshold tends to infinity. Our proof is achieved by making a connection between sequential change-point and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical and real data examples validate our theory.

1.2STDec 5, 2016

Dynamic change-point detection using similarity networks

Shanshan Cao, Yao Xie

From a sequence of similarity networks, with edges representing certain similarity measures between nodes, we are interested in detecting a change-point which changes the statistical property of the networks. After the change, a subset of anomalous nodes which compares dissimilarly with the normal nodes. We study a simple sequential change detection procedure based on node-wise average similarity measures, and study its theoretical property. Simulation and real-data examples demonstrate such a simply stopping procedure has reasonably good performance. We further discuss the faulty sensor isolation (estimating anomalous nodes) using community detection.

1.0LGOct 14, 2016

Data-Driven Threshold Machine: Scan Statistics, Change-Point Detection, and Extreme Bandits

Shuang Li, Yao Xie, Le Song

We present a novel distribution-free approach, the data-driven threshold machine (DTM), for a fundamental problem at the core of many learning tasks: choose a threshold for a given pre-specified level that bounds the tail probability of the maximum of a (possibly dependent but stationary) random sequence. We do not assume data distribution, but rather relying on the asymptotic distribution of extremal values, and reduce the problem to estimate three parameters of the extreme value distributions and the extremal index. We specially take care of data dependence via estimating extremal index since in many settings, such as scan statistics, change-point detection, and extreme bandits, where dependence in the sequence of statistics can be significant. Key features of our DTM also include robustness and the computational efficiency, and it only requires one sample path to form a reliable estimate of the threshold, in contrast to the Monte Carlo sampling approach which requires drawing a large number of sample paths. We demonstrate the good performance of DTM via numerical examples in various dependent settings.

3.6MLOct 3, 2016

Sequential Low-Rank Change Detection

Yao Xie, Lee Seversky

Detecting emergence of a low-rank signal from high-dimensional data is an important problem arising from many applications such as camera surveillance and swarm monitoring using sensors. We consider a procedure based on the largest eigenvalue of the sample covariance matrix over a sliding window to detect the change. To achieve dimensionality reduction, we present a sketching-based approach for rank change detection using the low-dimensional linear sketches of the original high-dimensional observations. The premise is that when the sketching matrix is a random Gaussian matrix, and the dimension of the sketching vector is sufficiently large, the rank of sample covariance matrix for these sketches equals the rank of the original sample covariance matrix with high probability. Hence, we may be able to detect the low-rank change using sample covariance matrices of the sketches without having to recover the original covariance matrix. We character the performance of the largest eigenvalue statistic in terms of the false-alarm-rate and the expected detection delay, and present an efficient online implementation via subspace tracking.

4.3LGMar 29, 2016

Detecting weak changes in dynamic events over networks

Shuang Li, Yao Xie, Mehrdad Farajtabar et al.

Large volume of networked streaming event data are becoming increasingly available in a wide variety of applications, such as social network analysis, Internet traffic monitoring and healthcare analytics. Streaming event data are discrete observation occurred in continuous time, and the precise time interval between two events carries a great deal of information about the dynamics of the underlying systems. How to promptly detect changes in these dynamic systems using these streaming event data? In this paper, we propose a novel change-point detection framework for multi-dimensional event data over networks. We cast the problem into sequential hypothesis test, and derive the likelihood ratios for point processes, which are computed efficiently via an EM-like algorithm that is parameter-free and can be computed in a distributed fashion. We derive a highly accurate theoretical characterization of the false-alarm-rate, and show that it can achieve weak signal detection by aggregating local statistics over time and networks. Finally, we demonstrate the good performance of our algorithm on numerical examples and real-world datasets from twitter and Memetracker.

1.1LGSep 1, 2015

Online Supervised Subspace Tracking

Yao Xie, Ruiyang Song, Hanjun Dai et al.

We present a framework for supervised subspace tracking, when there are two time series $x_t$ and $y_t$, one being the high-dimensional predictors and the other being the response variables and the subspace tracking needs to take into consideration of both sequences. It extends the classic online subspace tracking work which can be viewed as tracking of $x_t$ only. Our online sufficient dimensionality reduction (OSDR) is a meta-algorithm that can be applied to various cases including linear regression, logistic regression, multiple linear regression, multinomial logistic regression, support vector machine, the random dot product model and the multi-scale union-of-subspace model. OSDR reduces data-dimensionality on-the-fly with low-computational complexity and it can also handle missing data and dynamic data. OSDR uses an alternating minimization scheme and updates the subspace via gradient descent on the Grassmannian manifold. The subspace update can be performed efficiently utilizing the fact that the Grassmannian gradient with respect to the subspace in many settings is rank-one (or low-rank in certain cases). The optimization problem for OSDR is non-convex and hard to analyze in general; we provide convergence analysis of OSDR in a simple linear regression setting. The good performance of OSDR compared with the conventional unsupervised subspace tracking are demonstrated via numerical examples on simulated and real data.

1.2ITSep 1, 2015

Sequential Information Guided Sensing

Ruiyang Song, Yao Xie, Sebastian Pokutta

We study the value of information in sequential compressed sensing by characterizing the performance of sequential information guided sensing in practical scenarios when information is inaccurate. In particular, we assume the signal distribution is parameterized through Gaussian or Gaussian mixtures with estimated mean and covariance matrices, and we can measure compressively through a noisy linear projection or using one-sparse vectors, i.e., observing one entry of the signal each time. We establish a set of performance bounds for the bias and variance of the signal estimator via posterior mean, by capturing the conditional entropy (which is also related to the size of the uncertainty), and the additional power required due to inaccurate information to reach a desired precision. Based on this, we further study how to estimate covariance based on direct samples or covariance sketching. Numerical examples also demonstrate the superior performance of Info-Greedy Sensing algorithms compared with their random and non-adaptive counterparts.

2.8MLSep 1, 2015

Multi-Sensor Slope Change Detection

Yang Cao, Yao Xie, Nagi Gebraeel

We develop a mixture procedure for multi-sensor systems to monitor data streams for a change-point that causes a gradual degradation to a subset of the streams. Observations are assumed to be initially normal random variables with known constant means and variances. After the change-point, observations in the subset will have increasing or decreasing means. The subset and the rate-of-changes are unknown. Our procedure uses a mixture statistics, which assumes that each sensor is affected by the change-point with probability $p_0$. Analytic expressions are obtained for the average run length (ARL) and the expected detection delay (EDD) of the mixture procedure, which are demonstrated to be quite accurate numerically. We establish the asymptotic optimality of the mixture procedure. Numerical examples demonstrate the good performance of the proposed procedure. We also discuss an adaptive mixture procedure using empirical Bayes. This paper extends our earlier work on detecting an abrupt change-point that causes a mean-shift, by tackling the challenges posed by the non-stationarity of the slope-change problem.

14.9LGJul 5, 2015

Scan $B$-Statistic for Kernel Change-Point Detection

Shuang Li, Yao Xie, Hanjun Dai et al.

Detecting the emergence of an abrupt change-point is a classic problem in statistics and machine learning. Kernel-based nonparametric statistics have been used for this task which enjoy fewer assumptions on the distributions than the parametric approach and can handle high-dimensional data. In this paper we focus on the scenario when the amount of background data is large, and propose two related computationally efficient kernel-based statistics for change-point detection, which are inspired by the recently developed $B$-statistics. A novel theoretical result of the paper is the characterization of the tail probability of these statistics using the change-of-measure technique, which focuses on characterizing the tail of the detection statistics rather than obtaining its asymptotic distribution under the null distribution. Such approximations are crucial to control the false alarm rate, which corresponds to the significance level in offline change-point detection and the average-run-length in online change-point detection. Our approximations are shown to be highly accurate. Thus, they provide a convenient way to find detection thresholds for both offline and online cases without the need to resort to the more expensive simulations or bootstrapping. We show that our methods perform well on both synthetic data and real data.

4.3NAJul 2, 2015

Categorical Matrix Completion

Yang Cao, Yao Xie

We consider the problem of completing a matrix with categorical-valued entries from partial observations. This is achieved by extending the formulation and theory of one-bit matrix completion. We recover a low-rank matrix $X$ by maximizing the likelihood ratio with a constraint on the nuclear norm of $X$, and the observations are mapped from entries of $X$ through multiple link functions. We establish theoretical upper and lower bounds on the recovery error, which meet up to a constant factor $\mathcal{O}(K^{3/2})$ where $K$ is the fixed number of categories. The upper bound in our case depends on the number of categories implicitly through a maximization of terms that involve the smoothness of the link functions. In contrast to one-bit matrix completion, our bounds for categorical matrix completion are optimal up to a factor on the order of the square root of the number of categories, which is consistent with an intuition that the problem becomes harder when the number of categories increases. By comparing the performance of our method with the conventional matrix completion method on the MovieLens dataset, we demonstrate the advantage of our method.

7.4LGMay 25, 2015

Sketching for Sequential Change-Point Detection

Yang Cao, Andrew Thompson, Meng Wang et al.

We study sequential change-point detection procedures based on linear sketches of high-dimensional signal vectors using generalized likelihood ratio (GLR) statistics. The GLR statistics allow for an unknown post-change mean that represents an anomaly or novelty. We consider both fixed and time-varying projections, derive theoretical approximations to two fundamental performance metrics: the average run length (ARL) and the expected detection delay (EDD); these approximations are shown to be highly accurate by numerical simulations. We further characterize the relative performance measure of the sketching procedure compared to that without sketching and show that there can be little performance loss when the signal strength is sufficiently large, and enough number of sketches are used. Finally, we demonstrate the good performance of sketching procedures using simulation and real-data examples on solar flare detection and failure detection in power networks.