Cun-Hui Zhang

ST
h-index11
20papers
1,024citations
Novelty57%
AI Score34

20 Papers

STJul 14, 2023
Adaptive Linear Estimating Equations

Mufang Ying, Koulik Khamaru, Cun-Hui Zhang

Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.

MLAug 8, 2024
Inference with the Upper Confidence Bound Algorithm

Koulik Khamaru, Cun-Hui Zhang

In this paper, we discuss the asymptotic behavior of the Upper Confidence Bound (UCB) algorithm in the context of multiarmed bandit problems and discuss its implication in downstream inferential tasks. While inferential tasks become challenging when data is collected in a sequential manner, we argue that this problem can be alleviated when the sequential algorithm at hand satisfies certain stability property. This notion of stability is motivated from the seminal work of Lai and Wei (1982). Our first main result shows that such a stability property is always satisfied for the UCB algorithm, and as a result the sample means for each arm are asymptotically normal. Next, we examine the stability properties of the UCB algorithm when the number of arms $K$ is allowed to grow with the number of arm pulls $T$. We show that in such a case the arms are stable when $\frac{\log K}{\log T} \rightarrow 0$, and the number of near-optimal arms are large.

STOct 1, 2023
Statistical Limits of Adaptive Linear Models: Low-Dimensional Estimation and Inference

Licong Lin, Mufang Ying, Suvrojit Ghosh et al.

Estimation and inference in statistics pose significant challenges when data are collected adaptively. Even in linear models, the Ordinary Least Squares (OLS) estimator may fail to exhibit asymptotic normality for single coordinate estimation and have inflated error. This issue is highlighted by a recent minimax lower bound, which shows that the error of estimating a single coordinate can be enlarged by a multiple of $\sqrt{d}$ when data are allowed to be arbitrarily adaptive, compared with the case when they are i.i.d. Our work explores this striking difference in estimation performance between utilizing i.i.d. and adaptive data. We investigate how the degree of adaptivity in data collection impacts the performance of estimating a low-dimensional parameter component in high-dimensional linear models. We identify conditions on the data collection mechanism under which the estimation error for a low-dimensional parameter component matches its counterpart in the i.i.d. setting, up to a factor that depends on the degree of adaptivity. We show that OLS or OLS on centered data can achieve this matching error. In addition, we propose a novel estimator for single coordinate inference via solving a Two-stage Adaptive Linear Estimating equation (TALE). Under a weaker form of adaptivity in data collection, we establish an asymptotic normality property of the proposed estimator.

STDec 9, 2024
UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

Qiyang Han, Koulik Khamaru, Cun-Hui Zhang

Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $σ\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

MLAug 10, 2021
Tensor Principal Component Analysis in High Dimensional CP Models

Yuefeng Han, Cun-Hui Zhang

The CP decomposition for high dimensional non-orthogonal spiked tensors is an important problem with broad applications across many disciplines. However, previous works with theoretical guarantee typically assume restrictive incoherence conditions on the basis vectors for the CP components. In this paper, we propose new computationally efficient composite PCA and concurrent orthogonalization algorithms for tensor CP decomposition with theoretical guarantees under mild incoherence conditions. The composite PCA applies the principal component or singular value decompositions twice, first to a matrix unfolding of the tensor data to obtain singular vectors and then to the matrix folding of the singular vectors obtained in the first step. It can be used as an initialization for any iterative optimization schemes for the tensor CP decomposition. The concurrent orthogonalization algorithm iteratively estimates the basis vector in each mode of the tensor by simultaneously applying projections to the orthogonal complements of the spaces generated by other CP components in other modes. It is designed to improve the alternating least squares estimator and other forms of the high order orthogonal iteration for tensors with low or moderately high CP ranks, and it is guaranteed to converge rapidly when the error of any given initial estimator is bounded by a small constant. Our theoretical investigation provides estimation accuracy and convergence rates for the two proposed algorithms. Both proposed algorithms are applicable to deterministic tensor, its noisy version, and the order-$2K$ covariance tensor of order-$K$ tensor data in a factor model with uncorrelated factors. Our implementations on synthetic data demonstrate significant practical superiority of our approach over existing methods.

STJul 8, 2021
Asymptotic normality of robust $M$-estimators with convex penalty

Pierre C Bellec, Yiwei Shen, Cun-Hui Zhang

This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\leγ$ for some fixed constant $γ>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions including the Huber loss and its smoothed versions regularized with a strongly convex penalty. The asymptotic variance that characterizes the width of the resulting confidence intervals is estimated with data-driven quantities. This estimate of the variance adapts automatically to low ($p/n\to0)$ or high ($p/n \le γ$) dimensions and does not involve the proximal operators seen in previous works on asymptotic normality of M-estimators. For the Huber loss, the estimated variance has a simple expression involving an effective degrees-of-freedom as well as an effective sample size. The case of the Huber loss with Elastic-Net penalty is studied in details and a simulation study confirms the theoretical findings. The asymptotic normality results follow from Stein formulae for high-dimensional random vectors on the sphere developed in the paper which are of independent interest.

STFeb 24, 2019
De-Biasing The Lasso With Degrees-of-Freedom Adjustment

Pierre C. Bellec, Cun-Hui Zhang

This paper studies schemes to de-bias the Lasso in a linear model $y=Xβ+ε$ where the goal is to construct confidence intervals for $a_0^Tβ$ in a direction $a_0$, where $X$ has iid $N(0,Σ)$ rows. We show that previously analyzed propositions to de-bias the Lasso require a modification in order to enjoy efficiency in a full range of sparsity. This modification takes the form of a degrees-of-freedom adjustment that accounts for the dimension of the model selected by Lasso. Let $s_0$ be the true sparsity. If $Σ$ is known and the ideal score vector proportional to $XΣ^{-1}a_0$ is used, the unadjusted de-biasing schemes proposed previously enjoy efficiency if $s_0\lll n^{2/3}$. However, if $s_0\ggg n^{2/3}$, the unadjusted schemes cannot be efficient in certain $a_0$: then it is necessary to modify existing procedures by a degrees-of-freedom adjustment. This modification grants asymptotic efficiency for any $a_0$ when $s_0/p\to 0$ and $s_0\log(p/s_0)/n \to 0$. If $Σ$ is unknown, efficiency is granted for general $a_0$ when $$\frac{s_0\log p}{n}+\min\Big\{\frac{s_Ω\log p}{n},\frac{\|Σ^{-1}a_0\|_1\sqrt{\log p}}{\|Σ^{-1/2}a_0\|_2 \sqrt n}\Big\}+\frac{\min(s_Ω,s_0)\log p}{\sqrt n}\to0$$ where $s_Ω=\|Σ^{-1}a_0\|_0$, provided that the de-biased estimate is modified with the degrees-of-freedom adjustment. The dependence in $s_0,s_Ω$ and $\|Σ^{-1}a_0\|_1$ is optimal. Our estimated score vector provides a novel methodology to handle dense $a_0$. Our analysis shows that the degrees-of-freedom adjustment is not needed when the initial bias in direction $a_0$ is small, which is granted under stringent conditions on $Σ^{-1}$. The main proof argument is an interpolation path similar to that typically used to derive Slepian's lemma. It yields a new $\ell_\infty$ error bound for the Lasso which is of independent interest.

MLNov 14, 2017
Statistically Optimal and Computationally Efficient Low Rank Tensor Completion from Noisy Entries

Dong Xia, Ming Yuan, Cun-Hui Zhang

In this article, we develop methods for estimating a low rank tensor from noisy observations on a subset of its entries to achieve both statistical and computational efficiencies. There have been a lot of recent interests in this problem of noisy tensor completion. Much of the attention has been focused on the fundamental computational challenges often associated with problems involving higher order tensors, yet very little is known about their statistical performance. To fill in this void, in this article, we characterize the fundamental statistical limits of noisy tensor completion by establishing minimax optimal rates of convergence for estimating a $k$th order low rank tensor under the general $\ell_p$ ($1\le p\le 2$) norm which suggest significant room for improvement over the existing approaches. Furthermore, we propose a polynomial-time computable estimating procedure based upon power iteration and a second-order spectral initialization that achieves the optimal rates of convergence. Our method is fairly easy to implement and numerical experiments are presented to further demonstrate the practical merits of our estimator.

MEAug 1, 2016
Theory of the GMM Kernel

Ping Li, Cun-Hui Zhang

We develop some theoretical results for a robust similarity measure named "generalized min-max" (GMM). This similarity has direct applications in machine learning as a positive definite kernel and can be efficiently computed via probabilistic hashing. Owing to the discrete nature, the hashed values can also be used for efficient near neighbor search. We prove the theoretical limit of GMM and the consistency result, assuming that the data follow an elliptical distribution, which is a very general family of distributions and includes the multivariate $t$-distribution as a special case. The consistency result holds as long as the data have bounded first moment (an assumption which essentially holds for datasets commonly encountered in practice). Furthermore, we establish the asymptotic normality of GMM. Compared to the "cosine" similarity which is routinely adopted in current practice in statistics and machine learning, the consistency of GMM requires much weaker conditions. Interestingly, when the data follow the $t$-distribution with $ν$ degrees of freedom, GMM typically provides a better measure of similarity than "cosine" roughly when $ν<8$ (which is already very close to normal). These theoretical results will help explain the recent success of GMM in learning tasks.

STJun 10, 2016
Incoherent Tensor Norms and Their Applications in Higher Order Tensor Completion

Ming Yuan, Cun-Hui Zhang

In this paper, we investigate the sample size requirement for a general class of nuclear norm minimization methods for higher order tensor completion. We introduce a class of tensor norms by allowing for different levels of coherence, which allows us to leverage the incoherence of a tensor. In particular, we show that a $k$th order tensor of rank $r$ and dimension $d\times\cdots\times d$ can be recovered perfectly from as few as $O((r^{(k-1)/2}d^{3/2}+r^{k-1}d)(\log(d))^2)$ uniformly sampled entries through an appropriate incoherent nuclear norm minimization. Our results demonstrate some key differences between completing a matrix and a higher order tensor: They not only point to potential room for improvement over the usual nuclear norm minimization but also highlight the importance of explicitly accounting for incoherence, when dealing with higher order tensors.

MEAug 11, 2014
Compressed Sensing with Very Sparse Gaussian Random Projections

Ping Li, Cun-Hui Zhang

We study the use of very sparse random projections for compressed sensing (sparse signal recovery) when the signal entries can be either positive or negative. In our setting, the entries of a Gaussian design matrix are randomly sparsified so that only a very small fraction of the entries are nonzero. Our proposed decoding algorithm is simple and efficient in that the major cost is one linear scan of the coordinates. We have developed two estimators: (i) the {\em tie estimator}, and (ii) the {\em absolute minimum estimator}. Using only the tie estimator, we are able to recover a $K$-sparse signal of length $N$ using $1.551 eK \log K/δ$ measurements (where $δ\leq 0.05$ is the confidence). Using only the absolute minimum estimator, we can detect the support of the signal using $eK\log N/δ$ measurements. For a particular coordinate, the absolute minimum estimator requires fewer measurements (i.e., with a constant $e$ instead of $1.551e$). Thus, the two estimators can be combined to form an even more practical decoding framework. Prior studies have shown that existing one-scan (or roughly one-scan) recovery algorithms using sparse matrices would require substantially more (e.g., one order of magnitude) measurements than L1 decoding by linear programming, when the nonzero entries of signals can be either negative or positive. In this paper, following a known experimental setup, we show that, at the same number of measurements, the recovery accuracies of our proposed method are (at least) similar to the standard L1 decoding.

MLMay 7, 2014
On Tensor Completion via Nuclear Norm Minimization

Ming Yuan, Cun-Hui Zhang

Many problems can be formulated as recovering a low-rank tensor. Although an increasingly common task, tensor recovery remains a challenging problem because of the delicacy associated with the decomposition of higher order tensors. To overcome these difficulties, existing approaches often proceed by unfolding tensors into matrices and then apply techniques for matrix completion. We show here that such matricization fails to exploit the tensor structure and may lead to suboptimal procedure. More specifically, we investigate a convex optimization approach to tensor completion by directly minimizing a tensor nuclear norm and prove that this leads to an improved sample size requirement. To establish our results, we develop a series of algebraic and probabilistic techniques such as characterization of subdifferetial for tensor nuclear norm and concentration inequalities for tensor martingales, which may be of independent interests and could be useful in other tensor related problems.

MEDec 31, 2013
Sparse Recovery with Very Sparse Compressed Counting

Ping Li, Cun-Hui Zhang, Tong Zhang

Compressed sensing (sparse signal recovery) often encounters nonnegative data (e.g., images). Recently we developed the methodology of using (dense) Compressed Counting for recovering nonnegative K-sparse signals. In this paper, we adopt very sparse Compressed Counting for nonnegative signal recovery. Our design matrix is sampled from a maximally-skewed p-stable distribution (0<p<1), and we sparsify the design matrix so that on average (1-g)-fraction of the entries become zero. The idea is related to very sparse stable random projections (Li et al 2006 and Li 2007), the prior work for estimating summary statistics of the data. In our theoretical analysis, we show that, when p->0, it suffices to use M= K/(1-exp(-gK) log N measurements, so that all coordinates can be recovered in one scan of the coordinates. If g = 1 (i.e., dense design), then M = K log N. If g= 1/K or 2/K (i.e., very sparse design), then M = 1.58K log N or M = 1.16K log N. This means the design matrix can be indeed very sparse at only a minor inflation of the sample complexity. Interestingly, as p->1, the required number of measurements is essentially M = 2.7K log N, provided g= 1/K. It turns out that this result is a general worst-case bound.

MEOct 3, 2013
Compressed Counting Meets Compressed Sensing

Ping Li, Cun-Hui Zhang, Tong Zhang

Compressed sensing (sparse signal recovery) has been a popular and important research topic in recent years. By observing that natural signals are often nonnegative, we propose a new framework for nonnegative signal recovery using Compressed Counting (CC). CC is a technique built on maximally-skewed p-stable random projections originally developed for data stream computations. Our recovery procedure is computationally very efficient in that it requires only one linear scan of the coordinates. Our analysis demonstrates that, when 0<p<=0.5, it suffices to use M= O(C/eps^p log N) measurements so that all coordinates will be recovered within eps additive precision, in one scan of the coordinates. The constant C=1 when p->0 and C=pi/2 when p=0.5. In particular, when p->0 the required number of measurements is essentially M=K\log N, where K is the number of nonzero coordinates of the signal.

STSep 24, 2013
Asymptotic normality and optimalities in estimation of large Gaussian graphical models

Zhao Ren, Tingni Sun, Cun-Hui Zhang et al.

The Gaussian graphical model, a popular paradigm for studying relationship among variables in a wide range of applications, has attracted great attention in recent years. This paper considers a fundamental question: When is it possible to estimate low-dimensional parameters at parametric square-root rate in a large Gaussian graphical model? A novel regression approach is proposed to obtain asymptotically efficient estimation of each entry of a precision matrix under a sparseness condition relative to the sample size. When the precision matrix is not sufficiently sparse, or equivalently the sample size is not sufficiently large, a lower bound is established to show that it is no longer possible to achieve the parametric rate in the estimation of each entry. This lower bound result, which provides an answer to the delicate sample size question, is established with a novel construction of a subset of sparse precision matrices in an application of Le Cam's lemma. Moreover, the proposed estimator is proven to have optimal convergence rate when the parametric rate cannot be achieved, under a minimal sample requirement. The proposed estimator is applied to test the presence of an edge in the Gaussian graphical model or to recover the support of the entire model, to obtain adaptive rate-optimal estimation of the entire precision matrix as measured by the matrix $\ell_q$ operator norm and to make inference in latent variables in the graphical model. All of this is achieved under a sparsity condition on the precision matrix and a side condition on the range of its spectrum. This significantly relaxes the commonly imposed uniform signal strength condition on the precision matrix, irrepresentability condition on the Hessian tensor operator of the covariance matrix or the $\ell_1$ constraint on the precision matrix. Numerical results confirm our theoretical findings. The ROC curve of the proposed algorithm, Asymptotic Normal Thresholding (ANT), for support recovery significantly outperforms that of the popular GLasso algorithm.

MLFeb 4, 2013
Exact Sparse Recovery with L0 Projections

Ping Li, Cun-Hui Zhang

Many applications concern sparse signals, for example, detecting anomalies from the differences between consecutive images taken by surveillance cameras. This paper focuses on the problem of recovering a K-sparse signal x in N dimensions. In the mainstream framework of compressed sensing (CS), the vector x is recovered from M non-adaptive linear measurements y = xS, where S (of size N x M) is typically a Gaussian (or Gaussian-like) design matrix, through some optimization procedure such as linear programming (LP). In our proposed method, the design matrix S is generated from an $α$-stable distribution with $α\approx 0$. Our decoding algorithm mainly requires one linear scan of the coordinates, followed by a few iterations on a small number of coordinates which are "undetermined" in the previous iteration. Comparisons with two strong baselines, linear programming (LP) and orthogonal matching pursuit (OMP), demonstrate that our algorithm can be significantly faster in decoding speed and more accurate in recovery quality, for the task of exact spare recovery. Our procedure is robust against measurement noise. Even when there are no sufficient measurements, our algorithm can still reliably recover a significant portion of the nonzero coordinates. To provide the intuition for understanding our method, we also analyze the procedure by assuming an idealistic setting. Interestingly, when K=2, the "idealized" algorithm achieves exact recovery with merely 3 measurements, regardless of N. For general K, the required sample size of the "idealized" algorithm is about 5K.

LGAug 6, 2012
One Permutation Hashing for Efficient Search and Learning

Ping Li, Art Owen, Cun-Hui Zhang

Recently, the method of b-bit minwise hashing has been applied to large-scale linear learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing cost, as the method requires applying (e.g.,) k=200 to 500 permutations on the data. The testing time can also be expensive if a new data point (e.g., a new document or image) has not been processed, which might be a significant issue in user-facing applications. We develop a very simple solution based on one permutation hashing. Conceptually, given a massive binary data matrix, we permute the columns only once and divide the permuted columns evenly into k bins; and we simply store, for each data vector, the smallest nonzero location in each bin. The interesting probability analysis (which is validated by experiments) reveals that our one permutation scheme should perform very similarly to the original (k-permutation) minwise hashing. In fact, the one permutation scheme can be even slightly more accurate, due to the "sample-without-replacement" effect. Our experiments with training linear SVM and logistic regression on the webspam dataset demonstrate that this one permutation hashing scheme can achieve the same (or even slightly better) accuracies compared to the original k-permutation scheme. To test the robustness of our method, we also experiment with the small news20 dataset which is very sparse and has merely on average 500 nonzeros in each data vector. Interestingly, our one permutation scheme noticeably outperforms the k-permutation scheme when k is not too small on the news20 dataset. In summary, our method can achieve at least the same accuracy as the original k-permutation scheme, at merely 1/k of the original preprocessing cost.

STApr 29, 2012
Optimality of Graphlet Screening in High Dimensional Variable Selection

Jiashun Jin, Cun-Hui Zhang, Qi Zhang

Consider a linear regression model where the design matrix X has n rows and p columns. We assume (a) p is much large than n, (b) the coefficient vector beta is sparse in the sense that only a small fraction of its coordinates is nonzero, and (c) the Gram matrix G = X'X is sparse in the sense that each row has relatively few large coordinates (diagonals of G are normalized to 1). The sparsity in G naturally induces the sparsity of the so-called graph of strong dependence (GOSD). We find an interesting interplay between the signal sparsity and the graph sparsity, which ensures that in a broad context, the set of true signals decompose into many different small-size components of GOSD, where different components are disconnected. We propose Graphlet Screening (GS) as a new approach to variable selection, which is a two-stage Screen and Clean method. The key methodological innovation of GS is to use GOSD to guide both the screening and cleaning. Compared to m-variate brute-forth screening that has a computational cost of p^m, the GS only has a computational cost of p (up to some multi-log(p) factors) in screening. We measure the performance of any variable selection procedure by the minimax Hamming distance. We show that in a very broad class of situations, GS achieves the optimal rate of convergence in terms of the Hamming distance. Somewhat surprisingly, the well-known procedures subset selection and the lasso are rate non-optimal, even in very simple settings and even when their tuning parameters are ideally set.

STFeb 13, 2012
Sparse Matrix Inversion with Scaled Lasso

Tingni Sun, Cun-Hui Zhang

We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other $\ell_1$ regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the $\ell_1$ and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso.

MLJan 16, 2012
A General Framework of Dual Certificate Analysis for Structured Sparse Recovery Problems

Cun-Hui Zhang, Tong Zhang

This paper develops a general theoretical framework to analyze structured sparse recovery problems using the notation of dual certificate. Although certain aspects of the dual certificate idea have already been used in some previous work, due to the lack of a general and coherent theory, the analysis has so far only been carried out in limited scopes for specific problems. In this context the current paper makes two contributions. First, we introduce a general definition of dual certificate, which we then use to develop a unified theory of sparse recovery analysis for convex programming. Second, we present a class of structured sparsity regularization called structured Lasso for which calculations can be readily performed under our theoretical framework. This new theory includes many seemingly loosely related previous work as special cases; it also implies new results that improve existing ones even for standard formulations such as L1 regularization.