Anru R. Zhang

ST
h-index17
34papers
1,087citations
Novelty55%
AI Score58

34 Papers

LGSep 22, 2022
Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li et al.

We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an $L^2$-accurate score estimate (rather than $L^\infty$-accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does not reduce the complexity of SGMs.

STJun 17, 2022
Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay

Yuetian Luo, Anru R. Zhang

We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.

NAAug 11, 2024
Tensor Decomposition Meets RKHS: Efficient Algorithms for Smooth and Misaligned Data

Brett W. Larsen, Tamara G. Kolda, Anru R. Zhang et al.

The canonical polyadic (CP) tensor decomposition decomposes a multidimensional data array into a sum of outer products of finite-dimensional vectors. Instead, we can replace some or all of the vectors with continuous functions (infinite-dimensional vectors) from a reproducing kernel Hilbert space (RKHS). We refer to tensors with some infinite-dimensional modes as quasitensors, and the approach of decomposing a tensor with some continuous RKHS modes is referred to as CP-HiFi (hybrid infinite and finite dimensional) tensor decomposition. An advantage of CP-HiFi is that it can enforce smoothness in the infinite dimensional modes. Further, CP-HiFi does not require the observed data to lie on a regular and finite rectangular grid and naturally incorporates misaligned data. We detail the methodology and illustrate it on a synthetic example.

MEJul 2, 2023
Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model

Runshi Tang, Ming Yuan, Anru R. Zhang

This paper introduces a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. To enhance the understanding of the framework, we introduce a class of matrix-variate spiked covariance models that serve as inspiration for the development of the MOP-UP algorithm. The MOP-UP algorithm consists of two steps: Average Subspace Capture (ASC) and Alternating Projection (AP). These steps are specifically designed to capture the row-wise and column-wise dimension-reduced subspaces which contain the most informative features of the data. ASC utilizes a novel average projection operator as initialization and achieves exact recovery in the noiseless setting. We analyze the convergence and non-asymptotic error bounds of MOP-UP, introducing a blockwise matrix eigenvalue perturbation bound that proves the desired bound, where classic perturbation bounds fail. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets. Lastly, we discuss generalizations of our approach to higher-order data.

LGApr 8, 2022
Learning Polynomial Transformations

Sitan Chen, Jerry Li, Yuanzhi Li et al.

We consider the problem of learning high dimensional polynomial transformations of Gaussians. Given samples of the form $p(x)$, where $x\sim N(0, \mathrm{Id}_r)$ is hidden and $p: \mathbb{R}^r \to \mathbb{R}^d$ is a function where every output coordinate is a low-degree polynomial, the goal is to learn the distribution over $p(x)$. This problem is natural in its own right, but is also an important special case of learning deep generative models, namely pushforwards of Gaussians under two-layer neural networks with polynomial activations. Understanding the learnability of such generative models is crucial to understanding why they perform so well in practice. Our first main result is a polynomial-time algorithm for learning quadratic transformations of Gaussians in a smoothed setting. Our second main result is a polynomial-time algorithm for learning constant-degree polynomial transformations of Gaussian in a smoothed setting, when the rank of the associated tensors is small. In fact our results extend to any rotation-invariant input distribution, not just Gaussian. These are the first end-to-end guarantees for learning a pushforward under a neural network with more than one layer. Along the way, we also give the first polynomial-time algorithms with provable guarantees for tensor ring decomposition, a popular generalization of tensor decomposition that is used in practice to implicitly store large tensors.

STMar 9, 2023
Phase transition for detecting a small community in a large network

Jiashun Jin, Zheng Tracy Ke, Paxton Turner et al.

How to detect a small community in a large network is an interesting problem, including clique detection as a special case, where a naive degree-based $χ^2$-test was shown to be powerful in the presence of an Erdős-Renyi background. Using Sinkhorn's theorem, we show that the signal captured by the $χ^2$-test may be a modeling artifact, and it may disappear once we replace the Erdős-Renyi model by a broader network model. We show that the recent SgnQ test is more appropriate for such a setting. The test is optimal in detecting communities with sizes comparable to the whole network, but has never been studied for our setting, which is substantially different and more challenging. Using a degree-corrected block model (DCBM), we establish phase transitions of this testing problem concerning the size of the small community and the edge densities in small and large communities. When the size of the small community is larger than $\sqrt{n}$, the SgnQ test is optimal for it attains the computational lower bound (CLB), the information lower bound for methods allowing polynomial computation time. When the size of the small community is smaller than $\sqrt{n}$, we establish the parameter regime where the SgnQ test has full power and make some conjectures of the CLB. We also study the classical information lower bound (LB) and show that there is always a gap between the CLB and LB in our range of interest.

LGMay 8Code
Coupling Models for One-Step Discrete Generation

Fred Zhangzhi Peng, Avishek Joey Bose, Anru R. Zhang et al.

Generative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically,Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at https://github.com/pengzhangzhi/Coupling-Models.

LGMay 7Code
Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

Fred Zhangzhi Peng, Alexis Fox, Anru R. Zhang et al.

Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at https://github.com/pengzhangzhi/Open-dLLM.

CVSep 26, 2022
Enhancing convolutional neural network generalizability via low-rank weight approximation

Chenyin Gao, Shu Yang, Anru R. Zhang

Noise is ubiquitous during image acquisition. Sufficient denoising is often an important first step for image processing. In recent decades, deep neural networks (DNNs) have been widely used for image denoising. Most DNN-based image denoising methods require a large-scale dataset or focus on supervised settings, in which single/pairs of clean images or a set of noisy images are required. This poses a significant burden on the image acquisition process. Moreover, denoisers trained on datasets of limited scale may incur over-fitting. To mitigate these issues, we introduce a new self-supervised framework for image denoising based on the Tucker low-rank tensor approximation. With the proposed design, we are able to characterize our denoiser with fewer parameters and train it based on a single image, which considerably improves the model's generalizability and reduces the cost of data acquisition. Extensive experiments on both synthetic and real-world noisy images have been conducted. Empirical results show that our proposed method outperforms existing non-learning-based methods (e.g., low-pass filter, non-local mean), single-image unsupervised denoisers (e.g., DIP, NN+BM3D) evaluated on both in-sample and out-sample datasets. The proposed method even achieves comparable performances with some supervised methods (e.g., DnCNN).

LGOct 23, 2023
Reliable Generation of Privacy-preserving Synthetic Electronic Health Record Time Series via Diffusion Models

Muhang Tian, Bernie Chen, Allan Guo et al.

Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR de-identification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancement of medical research using EHR. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic electronic health records (EHRs) time series efficiently. We introduce a new method for generating diverse and realistic synthetic EHR time series data using Denoising Diffusion Probabilistic Models (DDPM). We conducted experiments on six databases: Medical Information Mart for Intensive Care III and IV (MIMIC-III/IV), the eICU Collaborative Research Database (eICU), and non-EHR datasets on Stocks and Energy. We compared our proposed method with eight existing methods. Our results demonstrate that our approach significantly outperforms all existing methods in terms of data fidelity while requiring less training effort. Additionally, data generated by our method yields a lower discriminative accuracy compared to other baseline methods, indicating the proposed method can generate data with less privacy risk. The proposed diffusion-model-based method can reliably and efficiently generate synthetic EHR time series, which facilitates the downstream medical data analysis. Our numerical results show the superiority of the proposed method over all other existing methods.

DBNov 2, 2025
Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints

Raymond M. Xiong, Panyu Chen, Tianze Dong et al.

Electronic health records (EHRs) are central to modern healthcare delivery and research; yet, many researchers lack the database expertise necessary to write complex SQL queries or generate effective visualizations, limiting efficient data use and scientific discovery. To address this barrier, we introduce CELEC, a large language model (LLM)-powered framework for automated EHR data extraction and analytics. CELEC translates natural language queries into SQL using a prompting strategy that integrates schema information, few-shot demonstrations, and chain-of-thought reasoning, which together improve accuracy and robustness. On a subset of the EHRSQL benchmark, CELEC achieves execution accuracy comparable to prior systems while maintaining low latency, cost efficiency, and strict privacy by exposing only database metadata to the LLM. CELEC also adheres to strict privacy protocols: the LLM accesses only database metadata (e.g., table and column names), while all query execution occurs securely within the institutional environment, ensuring that no patient-level data is ever transmitted to or shared with the LLM. Ablation studies confirm that each component of the SQL generation pipeline, particularly the few-shot demonstrations, plays a critical role in performance. By lowering technical barriers and enabling medical researchers to query EHR databases directly, CELEC streamlines research workflows and accelerates biomedical discovery.

MLJan 22
Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

Zhengchi Ma, Anru R. Zhang

Imbalanced classification often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic samples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples. Our theory shows that synthetic data is not always beneficial. In a "local symmetry" regime, imbalance is not the dominant source of error, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help ("local asymmetry"), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures. Extensive simulations and real data analysis further support our findings.

MLOct 30, 2025
Bias-Corrected Data Synthesis for Imbalanced Learning

Pengfei Lyu, Zhengchi Ma, Linjun Zhang et al.

Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.

STMar 27
Detection Is Harder Than Estimation in Certain Regimes: Inference for Moment and Cumulant Tensors

Runshi Tang, Yuefeng Han, Anru R. Zhang

We study estimation and detection of high-order moment and cumulant tensors from $n$ i.i.d. observations of a $p$-dimensional random vector, with performance measured in tensor spectral norm. On the statistical side, we prove that under sub-Gaussianity, the minimax rate for estimating the order-$d$ moment and cumulant tensors is $\sqrt{p/n}\wedge 1$. In contrast to covariance estimation, the sample moment tensor is generally no longer rate-optimal for higher-order moments. We therefore develop an estimator that attains the minimax rate up to logarithmic factors through a convex feasibility formulation over an $\varepsilon$-net of the unit sphere. On the computational side, we study the problem of testing whether the $d$-th order cumulant tensor vanishes after whitening. Using the low-degree polynomial framework, we provide evidence that detection is computationally hard when $n\ll p^{d/2}$. At the same time, we identify a regime in which an efficiently computable estimator attains error smaller than the separation at which low-degree tests can reliably distinguish the null from the alternative. This yields the striking conclusion that computationally efficient detection can be harder than computationally efficient estimation, revealing an unusual reverse detection-estimation gap: in a broad regime, computationally efficient estimation is possible at a smaller scale than computationally efficient detection. This phenomenon arises because the computational difficulty is driven not only by the statistical model, but also by the loss function itself: tensor spectral norm is NP-hard to compute. This feature makes the proposed open problems regarding computational lower bounds for estimation qualitatively different from the existing literature. Our results therefore uncover a new kind of computational--statistical gap.

MLOct 17, 2024
Tensor Decomposition with Unaligned Observations

Runshi Tang, Tamara Kolda, Anru R. Zhang

This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the $\ell_2$ loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset.

LGOct 10, 2025
Why Do Transformers Fail to Forecast Time Series In-Context?

Yufa Zhou, Yixiao Wang, Surbhi Goel et al.

Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning, despite significant recent efforts leveraging Large Language Models (LLMs), which predominantly rely on Transformer architectures. Empirical evidence consistently shows that even powerful Transformers often fail to outperform much simpler models, e.g., linear models, on TSF tasks; however, a rigorous theoretical understanding of this phenomenon remains limited. In this paper, we provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory. Specifically, under AR($p$) data, we establish that: (1) Linear Self-Attention (LSA) models $\textit{cannot}$ achieve lower expected MSE than classical linear models for in-context forecasting; (2) as the context length approaches to infinity, LSA asymptotically recovers the optimal linear predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions collapse to the mean exponentially. We empirically validate these findings through carefully designed experiments. Our theory not only sheds light on several previously underexplored phenomena but also offers practical insights for designing more effective forecasting architectures. We hope our work encourages the broader research community to revisit the fundamental theoretical limitations of TSF and to critically evaluate the direct application of increasingly sophisticated architectures without deeper scrutiny.

STApr 1
A General Framework for Computational Lower Bounds in Nontrivial Norm Approximation

Runshi Tang, Yuefeng Han, Anru R. Zhang

In this note, we propose a general framework for proving computational lower bounds in norm approximation by leveraging a reverse detection--estimation gap. The starting point is a testing problem together with an estimator whose error is significantly smaller than the corresponding computational detection threshold. We show that such a gap yields a lower bound on the approximation distortion achievable by any algorithm in the underlying computational class. In this way, reverse detection--estimation gaps can be turned into a general mechanism for certifying the hardness of approximating nontrivial norms. We apply this framework to the spectral norm of order-$d$ symmetric tensors in $\mathbb{R}^{p^d}$. Using a recently established low-degree hardness result for detecting nonzero high-order cumulant tensors, together with an efficiently computable estimator whose error is below the low-degree detection threshold, we prove that any degree-$D$ low-degree algorithm with $D \le c_d(\log p)^2$ must incur distortion at least $p^{d/4-1/2}/\operatorname{polylog}(p)$ for the tensor spectral norm. Under the low-degree conjecture, the same conclusion extends to all polynomial-time algorithms. In several important settings, this lower bound matches the best known upper bounds up to polylogarithmic factors, suggesting that the exponent $d/4-1/2$ captures a genuine computational barrier. Our results provide evidence that the difficulty of approximating tensor spectral norm is not merely an artifact of existing techniques, but reflects a broader computational barrier.

AIOct 10, 2025
The Geometry of Reasoning: Flowing Logics in Representation Space

Yufa Zhou, Yixiao Wang, Xunjian Yin et al.

We study how large language models (LLMs) ``think'' through their representation space. We propose a novel geometric framework that models an LLM's reasoning as flows -- embedding trajectories evolving where logic goes. We disentangle logical structure from semantics by employing the same natural deduction propositions with varied semantic carriers, allowing us to test whether LLMs internalize logic beyond surface form. This perspective connects reasoning with geometric quantities such as position, velocity, and curvature, enabling formal analysis in representation and concept spaces. Our theory establishes: (1) LLM reasoning corresponds to smooth flows in representation space, and (2) logical statements act as local controllers of these flows' velocities. Using learned representation proxies, we design controlled experiments to visualize and quantify reasoning flows, providing empirical validation of our theoretical framework. Our work serves as both a conceptual foundation and practical tools for studying reasoning phenomenon, offering a new lens for interpretability and formal analysis of LLMs' behavior.

LGSep 27, 2025
Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks et al.

Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through flexible and parallel generation paths. This flexibility is enabled by new sampling strategies, or planners, that iteratively choose where to denoise along the sequence rather than sampling uniformly at random. However, by modifying reverse paths, planners introduce a mismatch between the uniformly random denoising paths used during training and the planning-based paths used at inference. In this work, we systematically investigate this mismatch and theoretically show that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser under non-uniform planning. To bridge this gap, we derive a new Planned Evidence Lower Bound (P-ELBO) that directly incorporates planner-based reverse dynamics into the training objective. Building on this, we propose Planner Aware Path Learning (PAPL), a simple and effective modification of the standard masked discrete diffusion loss that aligns training and inference under planned denoisers. Empirically, PAPL delivers consistent improvements across domains, including a 40% relative gain in protein sequence modeling, up to a 4x improvement in MAUVE for text generation, and a 23% relative gain in HumanEval pass@10 for code generation.

MLAug 19, 2025
Smooth Flow Matching

Jianbin Tan, Anru R. Zhang

Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.

OCOct 23, 2021
On Geometric Connections of Embedded and Quotient Geometries in Riemannian Fixed-rank Matrix Optimization

Yuetian Luo, Xudong Li, Anru R. Zhang

In this paper, we propose a general procedure for establishing the geometric landscape connections of a Riemannian optimization problem under the embedded and quotient geometries. By applying the general procedure to the fixed-rank positive semidefinite (PSD) and general matrix optimization, we establish an exact Riemannian gradient connection under two geometries at every point on the manifold and sandwich inequalities between the spectra of Riemannian Hessians at Riemannian first-order stationary points (FOSPs). These results immediately imply an equivalence on the sets of Riemannian FOSPs, Riemannian second-order stationary points (SOSPs), and strict saddles of fixed-rank matrix optimization under the embedded and the quotient geometries. To the best of our knowledge, this is the first geometric landscape connection between the embedded and the quotient geometries for fixed-rank matrix optimization and it provides a concrete example of how these two geometries are connected in Riemannian optimization. In addition, the effects of the Riemannian metric and quotient structure on the landscape connection are discussed. We also observe an algorithmic connection between two geometries with some specific Riemannian metrics in fixed-rank matrix optimization: there is an equivalence between gradient flows under two geometries with shared spectra of Riemannian Hessians. A number of novel ideas and technical ingredients including a unified treatment for different Riemannian metrics, novel metrics for the Stiefel manifold, and new horizontal space representations under quotient geometries are developed to obtain our results. The results in this paper deepen our understanding of geometric and algorithmic connections of Riemannian optimization under different Riemannian geometries and provide a few new theoretical insights to unanswered questions in the literature.

OCAug 3, 2021
Nonconvex Factorization and Manifold Formulations are Almost Equivalent in Low-rank Matrix Optimization

Yuetian Luo, Xudong Li, Anru R. Zhang

In this paper, we consider the geometric landscape connection of the widely studied manifold and factorization formulations in low-rank positive semidefinite (PSD) and general matrix optimization. We establish a sandwich relation on the spectrum of Riemannian and Euclidean Hessians at first-order stationary points (FOSPs). As a result of that, we obtain an equivalence on the set of FOSPs, second-order stationary points (SOSPs) and strict saddles between the manifold and the factorization formulations. In addition, we show the sandwich relation can be used to transfer more quantitative geometric properties from one formulation to another. Similarities and differences in the landscape connection under the PSD case and the general case are discussed. To the best of our knowledge, this is the first geometric landscape connection between the manifold and the factorization formulations for handling rank constraints, and it provides a geometric explanation for the similar empirical performance of factorization and manifold approaches in low-rank matrix optimization observed in the literature. In the general low-rank matrix optimization, the landscape connection of two factorization formulations (unregularized and regularized ones) is also provided. By applying these geometric landscape connections, in particular, the sandwich relation, we are able to solve unanswered questions in literature and establish stronger results in the applications on geometric analysis of phase retrieval, well-conditioned low-rank matrix optimization, and the role of regularization in factorization arising from machine learning and signal processing.

MLApr 24, 2021
Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence

Yuetian Luo, Anru R. Zhang

In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We consider an efficient Riemannian Gauss-Newton (RGN) method for low Tucker rank tensor estimation. Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first local quadratic convergence guarantee of RGN for low-rank tensor estimation in the noisy setting under some regularity conditions and provide the corresponding estimation error upper bounds. A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. The merit of RGN is illustrated through two machine learning applications: tensor regression and tensor SVD. Finally, we provide the simulation results to corroborate our theoretical findings.

STDec 29, 2020
Inference for Low-rank Tensors -- No Need to Debias

Dong Xia, Anru R. Zhang, Yuchen Zhou

In this paper, we consider the statistical inference for several low-rank tensor models. Specifically, in the Tucker low-rank tensor PCA or regression model, provided with any estimates achieving some attainable error rate, we develop the data-driven confidence regions for the singular subspace of the parameter tensor based on the asymptotic distribution of an updated estimate by two-iteration alternating minimization. The asymptotic distributions are established under some essential conditions on the signal-to-noise ratio (in PCA model) or sample size (in regression model). If the parameter tensor is further orthogonally decomposable, we develop the methods and non-asymptotic theory for inference on each individual singular vector. For the rank-one tensor PCA model, we establish the asymptotic distribution for general linear forms of principal components and confidence interval for each entry of the parameter tensor. Finally, numerical simulations are presented to corroborate our theoretical discoveries. In all these models, we observe that different from many matrix/vector settings in existing work, debiasing is not required to establish the asymptotic distribution of estimates or to make statistical inference on low-rank tensors. In fact, due to the widely observed statistical-computational-gap for low-rank tensor estimation, one usually requires stronger conditions than the statistical (or information-theoretic) limit to ensure the computationally feasible estimation is achievable. Surprisingly, such conditions ``incidentally" render a feasible low-rank tensor inference without debiasing.

MEDec 18, 2020
Exact Clustering in Tensor Block Model: Statistical Optimality and Computational Limit

Rungang Han, Yuetian Luo, Miaoyan Wang et al.

High-order clustering aims to identify heterogeneous substructures in multiway datasets that arise commonly in neuroimaging, genomics, social network studies, etc. The non-convex and discontinuous nature of this problem pose significant challenges in both statistics and computation. In this paper, we propose a tensor block model and the computationally efficient methods, \emph{high-order Lloyd algorithm} (HLloyd), and high-order spectral clustering (HSC), for high-order clustering. The convergence guarantees and statistical optimality are established for the proposed procedure under a mild sub-Gaussian noise assumption. Under the Gaussian tensor block model, we completely characterize the statistical-computational trade-off for achieving high-order exact clustering based on three different signal-to-noise ratio regimes. The analysis relies on new techniques of high-order spectral perturbation analysis and a ``singular-value-gap-free'' error bound in tensor estimation, which are substantially different from the matrix spectral analyses in the literature. Finally, we show the merits of the proposed procedures via extensive experiments on both synthetic and real datasets.

OCNov 17, 2020
Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence

Yuetian Luo, Wen Huang, Xudong Li et al.

In this paper, we propose {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). The key step of RISRO is recursive importance sketching, a new sketching framework based on deterministically designed recursive projections, which significantly differs from the randomized sketching in the literature \citep{mahoney2011randomized,woodruff2014sketching}. Several existing algorithms in the literature can be reinterpreted under this new sketching framework and RISRO offers clear advantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is to solve a dimension-reduced least squares problem. We establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. We also discover a deep connection of RISRO to the Riemannian Gauss-Newton algorithm on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.

STOct 6, 2020
Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration

Yuchen Zhou, Anru R. Zhang, Lili Zheng et al.

This paper studies a general framework for high-order tensor SVD. We propose a new computationally efficient algorithm, tensor-train orthogonal iteration (TTOI), that aims to estimate the low tensor-train rank structure from the noisy high-order tensor observation. The proposed TTOI consists of initialization via TT-SVD (Oseledets, 2011) and new iterative backward/forward updates. We develop the general upper bound on estimation error for TTOI with the support of several new representation lemmas on tensor matricizations. By developing a matching information-theoretic lower bound, we also prove that TTOI achieves the minimax optimality under the spiked tensor model. The merits of the proposed TTOI are illustrated through applications to estimation and dimension reduction of high-order Markov processes, numerical studies, and a real data example on New York City taxi travel records. The software of the proposed algorithm is available online$^6$.

MLSep 12, 2020
Open Problem: Average-Case Hardness of Hypergraphic Planted Clique Detection

Yuetian Luo, Anru R. Zhang

We note the significance of hypergraphic planted clique (HPC) detection in the investigation of computational hardness for a range of tensor problems. We ask if more evidence for the computational hardness of HPC detection can be developed. In particular, we conjecture if it is possible to establish the equivalence of the computational hardness between HPC and PC detection.

STAug 6, 2020
A Sharp Blockwise Tensor Perturbation Bound for Orthogonal Iteration

Yuetian Luo, Garvesh Raskutti, Ming Yuan et al.

In this paper, we develop novel perturbation bounds for the high-order orthogonal iteration (HOOI) [DLDMV00b]. Under mild regularity conditions, we establish blockwise tensor perturbation bounds for HOOI with guarantees for both tensor reconstruction in Hilbert-Schmidt norm $\|\widehat{\bcT} - \bcT \|_{\tHS}$ and mode-$k$ singular subspace estimation in Schatten-$q$ norm $\| \sin Θ(\widehat{\U}_k, \U_k) \|_q$ for any $q \geq 1$. We show the upper bounds of mode-$k$ singular subspace estimation are unilateral and converge linearly to a quantity characterized by blockwise errors of the perturbation and signal strength. For the tensor reconstruction error bound, we express the bound through a simple quantity $ξ$, which depends only on perturbation and the multilinear rank of the underlying signal. Rate matching deterministic lower bound for tensor reconstruction, which demonstrates the optimality of HOOI, is also provided. Furthermore, we prove that one-step HOOI (i.e., HOOI with only a single iteration) is also optimal in terms of tensor reconstruction and can be used to lower the computational cost. The perturbation results are also extended to the case that only partial modes of $\bcT$ have low-rank structure. We support our theoretical results by extensive numerical studies. Finally, we apply the novel perturbation bounds of HOOI on two applications, tensor denoising and tensor co-clustering, from machine learning and statistics, which demonstrates the superiority of the new perturbation results.

STMay 21, 2020
Tensor Clustering with Planted Structures: Statistical Optimality and Computational Limits

Yuetian Luo, Anru R. Zhang

This paper studies the statistical and computational limits of high-order clustering with planted structures. We focus on two clustering models, constant high-order clustering (CHC) and rank-one higher-order clustering (ROHC), and study the methods and theory for testing whether a cluster exists (detection) and identifying the support of cluster (recovery). Specifically, we identify the sharp boundaries of signal-to-noise ratio for which CHC and ROHC detection/recovery are statistically possible. We also develop the tight computational thresholds: when the signal-to-noise ratio is below these thresholds, we prove that polynomial-time algorithms cannot solve these problems under the computational hardness conjectures of hypergraphic planted clique (HPC) detection and hypergraphic planted dense subgraph (HPDS) recovery. We also propose polynomial-time tensor algorithms that achieve reliable detection and recovery when the signal-to-noise ratio is above these thresholds. Both sparsity and tensor structures yield the computational barriers in high-order tensor clustering. The interplay between them results in significant differences between high-order tensor clustering and matrix clustering in literature in aspects of statistical and computational phase transition diagrams, algorithmic approaches, hardness conjecture, and proof techniques. To our best knowledge, we are the first to give a thorough characterization of the statistical and computational trade-off for such a double computational-barrier problem. Finally, we provide evidence for the computational hardness conjectures of HPC detection (via low-degree polynomial and Metropolis methods) and HPDS recovery (via low-degree polynomial method).

STFeb 26, 2020
An Optimal Statistical and Computational Framework for Generalized Tensor Estimation

Rungang Han, Rebecca Willett, Anru R. Zhang

This paper describes a flexible framework for generalized low-rank tensor estimation problems that includes many important instances arising from applications in computational imaging, genomics, and network analysis. The proposed estimator consists of finding a low-rank tensor fit to the data under generalized parametric models. To overcome the difficulty of non-convexity in these problems, we introduce a unified approach of projected gradient descent that adapts to the underlying low-rank structure. Under mild conditions on the loss function, we establish both an upper bound on statistical error and the linear rate of computational convergence through a general deterministic analysis. Then we further consider a suite of generalized tensor estimation problems, including sub-Gaussian tensor PCA, tensor regression, and Poisson and binomial tensor PCA. We prove that the proposed algorithm achieves the minimax optimal rate of convergence in estimation error. Finally, we demonstrate the superiority of the proposed framework via extensive experiments on both simulated and real data.

STSep 21, 2019
Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

T. Tony Cai, Anru R. Zhang, Yuchen Zhou

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

PROct 21, 2018
On the Non-asymptotic and Sharp Lower Tail Bounds of Random Variables

Anru R. Zhang, Yuchen Zhou

The non-asymptotic tail bounds of random variables play crucial roles in probability, statistics, and machine learning. Despite much success in developing upper bounds on tail probability in literature, the lower bounds on tail probabilities are relatively fewer. In this paper, we introduce systematic and user-friendly schemes for developing non-asymptotic lower bounds of tail probabilities. In addition, we develop sharp lower tail bounds for the sum of independent sub-Gaussian and sub-exponential random variables, which match the classic Hoeffding-type and Bernstein-type concentration inequalities, respectively. We also provide non-asymptotic matching upper and lower tail bounds for a suite of distributions, including gamma, beta, (regular, weighted, and noncentral) chi-square, binomial, Poisson, Irwin-Hall, etc. We apply the result to establish the matching upper and lower bounds for extreme value expectation of the sum of independent sub-Gaussian and sub-exponential random variables. A statistical application of signal identification from sparse heterogeneous mixtures is finally considered.

STOct 19, 2018
Heteroskedastic PCA: Algorithm, Optimality, and Applications

Anru R. Zhang, T. Tony Cai, Yihong Wu

A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on singular subspaces, which can be of independent interest. The effectiveness of the proposed algorithm is demonstrated in a suite of problems in high-dimensional statistics, including singular value decomposition (SVD) under heteroskedastic noise, Poisson PCA, and SVD for heteroskedastic and incomplete data.