Garvesh Raskutti

ML
h-index29
28papers
980citations
Novelty52%
AI Score53

28 Papers

MLJul 19, 2022
Lazy Estimation of Variable Importance for Large Neural Networks

Yue Gao, Abby Stevens, Rebecca Willet et al.

As opaque predictive models increasingly impact many areas of modern life, interest in quantifying the importance of a given input variable for making a specific prediction has grown. Recently, there has been a proliferation of model-agnostic methods to measure variable importance (VI) that analyze the difference in predictive power between a full model trained on all variables and a reduced model that excludes the variable(s) of interest. A bottleneck common to these methods is the estimation of the reduced model for each variable (or subset of variables), which is an expensive process that often does not come with theoretical guarantees. In this work, we propose a fast and flexible method for approximating the reduced model with important inferential guarantees. We replace the need for fully retraining a wide neural network by a linearization initialized at the full model parameters. By adding a ridge-like penalty to make the problem convex, we prove that when the ridge penalty parameter is sufficiently large, our method estimates the variable importance measure with an error rate of $O(\frac{1}{\sqrt{n}})$ where $n$ is the number of training samples. We also show that our estimator is asymptotically normal, enabling us to provide confidence bounds for the VI estimates. We demonstrate through simulations that our method is fast and accurate under several data-generating regimes, and we demonstrate its real-world applicability on a seasonal climate forecasting example.

68.0MLApr 16
MinShap: A Modified Shapley Value Approach for Feature Selection

Chenghui Zheng, Garvesh Raskutti

Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emph{MinShap}, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through numerical simulations and real data experiments that MinShap tends to outperform state-of-the-art feature selection algorithms such as LOCO, GCM and Lasso in terms of both accuracy and stability. We also introduce a suite of algorithms related to MinShap by using the multiple testing/p-value perspective that improves performance in lower-sample settings and provide supporting theoretical guarantees.

MLJun 11, 2023
Fast, Distribution-free Predictive Inference for Neural Networks with Coverage Guarantees

Yue Gao, Garvesh Raskutti, Rebecca Willet

This paper introduces a novel, computationally-efficient algorithm for predictive inference (PI) that requires no distributional assumptions on the data and can be computed faster than existing bootstrap-type methods for neural networks. Specifically, if there are $n$ training samples, bootstrap methods require training a model on each of the $n$ subsamples of size $n-1$; for large models like neural networks, this process can be computationally prohibitive. In contrast, our proposed method trains one neural network on the full dataset with $(ε, δ)$-differential privacy (DP) and then approximates each leave-one-out model efficiently using a linear approximation around the differentially-private neural network estimate. With exchangeable data, we prove that our approach has a rigorous coverage guarantee that depends on the preset privacy parameters and the stability of the neural network, regardless of the data distribution. Simulations and experiments on real data demonstrate that our method satisfies the coverage guarantees with substantially reduced computation compared to bootstrap methods.

23.9LGMay 6
Online Localized Conformal Prediction

Yuheng Lai, Garvesh Raskutti

Conformal prediction is a framework that provides valid uncertainty quantification for general models with exchangeable data. However, in the online learning and time-series settings, exchangeability is not satisfied. Existing online conformal methods, such as adaptive conformal inference (ACI), can achieve long-run validity, yet they remain inefficient under covariate heterogeneity because they rely on global calibration. We propose \emph{Online Localized Conformal Prediction (OLCP)}, which combines online adaptation with covariate-dependent localization to better reflect heterogeneity. To reduce sensitivity to the localization bandwidth, we further develop \emph{OLCP-Hedge}, which performs bandwidth selection as an online expert aggregation problem using a constrained online convex optimization framework. Importantly, we provide coverage guarantees for both algorithms and demonstrate through simulations and real-data experiments that the proposed methods attain valid long-run coverage with narrower prediction sets than existing baselines.

MLDec 2, 2024
Reliable and scalable variable importance estimation via warm-start and early stopping

Zexuan Sun, Garvesh Raskutti

As opaque black-box predictive models become more prevalent, the need to develop interpretations for these models is of great interest. The concept of variable importance and Shapley values are interpretability measures that applies to any predictive model and assesses how much a variable or set of variables improves prediction performance. When the number of variables is large, estimating variable importance presents a significant computational challenge because re-training neural networks or other black-box algorithms requires significant additional computation. In this paper, we address this challenge for algorithms using gradient descent and gradient boosting (e.g. neural networks, gradient-boosted decision trees). By using the ideas of early stopping of gradient-based methods in combination with warm-start using the dropout method, we develop a scalable method to estimate variable importance for any algorithm that can be expressed as an iterative kernel update equation. Importantly, we provide theoretical guarantees by using the theory for early stopping of kernel-based methods for neural networks with sufficiently large (but not necessarily infinite) width and gradient-boosting decision trees that use symmetric trees as a weaker learner. We also demonstrate the efficacy of our methods through simulations and a real data example which illustrates the computational benefit of early stopping rather than fully re-training the model as well as the increased accuracy of our approach.

MLFeb 15
A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization

Zexuan Sun, Garvesh Raskutti

In the era of large language models (LLMs), fine-tuning pretrained models has become ubiquitous. Yet the theoretical underpinning remains an open question. A central question is why only a few epochs of fine-tuning are typically sufficient to achieve strong performance on many different tasks. In this work, we approach this question by developing a statistical framework, combining rigorous early stopping theory with the attention-based Neural Tangent Kernel (NTK) for LLMs, offering new theoretical insights on fine-tuning practices. Specifically, we formally extend classical NTK theory [Jacot et al., 2018] to non-random (i.e., pretrained) initializations and provide a convergence guarantee for attention-based fine-tuning. One key insight provided by the theory is that the convergence rate with respect to sample size is closely linked to the eigenvalue decay rate of the empirical kernel matrix induced by the NTK. We also demonstrate how the framework can be used to explain task vectors for multiple tasks in LLMs. Finally, experiments with modern language models on real-world datasets provide empirical evidence supporting our theoretical insights.

MLAug 19, 2025
Comparing Model-agnostic Feature Selection Methods through Relative Efficiency

Chenghui Zheng, Garvesh Raskutti

Feature selection and importance estimation in a model-agnostic setting is an ongoing challenge of significant interest. Wrapper methods are commonly used because they are typically model-agnostic, even though they are computationally intensive. In this paper, we focus on feature selection methods related to the Generalized Covariance Measure (GCM) and Leave-One-Covariate-Out (LOCO) estimation, and provide a comparison based on relative efficiency. In particular, we present a theoretical comparison under three model settings: linear models, non-linear additive models, and single index models that mimic a single-layer neural network. We complement this with extensive simulations and real data examples. Our theoretical results, along with empirical findings, demonstrate that GCM-related methods generally outperform LOCO under suitable regularity conditions. Furthermore, we quantify the asymptotic relative efficiency of these approaches. Our simulations and real data analysis include widely used machine learning methods such as neural networks and gradient boosting trees.

MLNov 19, 2021
Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

Hao Chen, Lili Zheng, Raed Al Kontar et al.

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparameter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.

LGNov 9, 2021
The Internet of Federated Things (IoFT): A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Raed Kontar, Naichen Shi, Xubo Yue et al.

The Internet of Things (IoT) is on the verge of a major paradigm shift. In the IoT system of the future, IoFT, the cloud will be substituted by the crowd where model training is brought to the edge, allowing IoT devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift was set into motion by the tremendous increase in computational power on IoT devices and the recent advances in decentralized and privacy-preserving model training, coined as federated learning (FL). This article provides a vision for IoFT and a systematic overview of current efforts towards realizing this vision. Specifically, we first introduce the defining characteristics of IoFT and discuss FL data-driven approaches, opportunities, and challenges that allow decentralized inference within three dimensions: (i) a global model that maximizes utility across all IoT devices, (ii) a personalized model that borrows strengths across all devices yet retains its own model, (iii) a meta-learning model that quickly adapts to new devices or learning tasks. We end by describing the vision and challenges of IoFT in reshaping different industries through the lens of domain experts. Those industries include manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing.

MLJun 28, 2021
Improved Prediction and Network Estimation Using the Monotone Single Index Multi-variate Autoregressive Model

Yue Gao, Garvesh Raskutti

Network estimation from multi-variate point process or time series data is a problem of fundamental importance. Prior work has focused on parametric approaches that require a known parametric model, which makes estimation procedures less robust to model mis-specification, non-linearities and heterogeneities. In this paper, we develop a semi-parametric approach based on the monotone single-index multi-variate autoregressive model (SIMAM) which addresses these challenges. We provide theoretical guarantees for dependent data and an alternating projected gradient descent algorithm. Significantly we do not explicitly assume mixing conditions on the process (although we do require conditions analogous to restricted strong convexity) and we achieve rates of the form $O(T^{-\frac{1}{3}} \sqrt{s\log(TM)})$ (optimal in the independent design case) where $s$ is the threshold for the maximum in-degree of the network that indicates the sparsity level, $M$ is the number of actors and $T$ is the number of time points. In addition, we demonstrate the superior performance both on simulated data and two real data examples where our SIMAM approach out-performs state-of-the-art parametric methods both in terms of prediction and network estimation.

MLMar 25, 2021
Prediction in the presence of response-dependent missing labels

Hyebin Song, Garvesh Raskutti, Rebecca Willett

In a variety of settings, limitations of sensing technologies or other sampling mechanisms result in missing labels, where the likelihood of a missing label in the training set is an unknown function of the data. For example, satellites used to detect forest fires cannot sense fires below a certain size threshold. In such cases, training datasets consist of positive and pseudo-negative observations where pseudo-negative observations can be either true negatives or undetected positives with small magnitudes. We develop a new methodology and non-convex algorithm P(ositive) U(nlabeled) - O(ccurrence) M(agnitude) M(ixture) which jointly estimates the occurrence and detection likelihood of positive samples, utilizing prior knowledge of the detection mechanism. Our approach uses ideas from positive-unlabeled (PU)-learning and zero-inflated models that jointly estimate the magnitude and occurrence of events. We provide conditions under which our model is identifiable and prove that even though our approach leads to a non-convex objective, any local minimizer has optimal statistical error (up to a log term) and projected gradient descent has geometric convergence rates. We demonstrate on both synthetic data and a California wildfire dataset that our method out-performs existing state-of-the-art approaches.

STAug 6, 2020
A Sharp Blockwise Tensor Perturbation Bound for Orthogonal Iteration

Yuetian Luo, Garvesh Raskutti, Ming Yuan et al.

In this paper, we develop novel perturbation bounds for the high-order orthogonal iteration (HOOI) [DLDMV00b]. Under mild regularity conditions, we establish blockwise tensor perturbation bounds for HOOI with guarantees for both tensor reconstruction in Hilbert-Schmidt norm $\|\widehat{\bcT} - \bcT \|_{\tHS}$ and mode-$k$ singular subspace estimation in Schatten-$q$ norm $\| \sin Θ(\widehat{\U}_k, \U_k) \|_q$ for any $q \geq 1$. We show the upper bounds of mode-$k$ singular subspace estimation are unilateral and converge linearly to a quantity characterized by blockwise errors of the perturbation and signal strength. For the tensor reconstruction error bound, we express the bound through a simple quantity $ξ$, which depends only on perturbation and the multilinear rank of the underlying signal. Rate matching deterministic lower bound for tensor reconstruction, which demonstrates the optimality of HOOI, is also provided. Furthermore, we prove that one-step HOOI (i.e., HOOI with only a single iteration) is also optimal in terms of tensor reconstruction and can be used to lower the computational cost. The perturbation results are also extended to the case that only partial modes of $\bcT$ have low-rank structure. We support our theoretical results by extensive numerical studies. Finally, we apply the novel perturbation bounds of HOOI on two applications, tensor denoising and tensor co-clustering, from machine learning and statistics, which demonstrates the superiority of the new perturbation results.

MLMar 16, 2020
Context-dependent self-exciting point processes: models, methods, and risk bounds in high dimensions

Lili Zheng, Garvesh Raskutti, Rebecca Willett et al.

High-dimensional autoregressive point processes model how current events trigger or inhibit future events, such as activity by one member of a social network can affect the future activity of his or her neighbors. While past work has focused on estimating the underlying network structure based solely on the times at which events occur on each node of the network, this paper examines the more nuanced problem of estimating context-dependent networks that reflect how features associated with an event (such as the content of a social media post) modulate the strength of influences among nodes. Specifically, we leverage ideas from compositional time series and regularization methods in machine learning to conduct network estimation for high-dimensional marked point processes. Two models and corresponding estimators are considered in detail: an autoregressive multinomial model suited to categorical marks and a logistic-normal model suited to marks with mixed membership in different categories. Importantly, the logistic-normal model leads to a convex negative log-likelihood objective and captures dependence across categories. We provide theoretical guarantees for both estimators, which we validate by simulations and a synthetic data-generating model. We further validate our methods through two real data examples and demonstrate the advantages and disadvantages of both approaches.

MLNov 9, 2019
ISLET: Fast and Optimal Low-rank Tensor Regression via Importance Sketching

Anru Zhang, Yuetian Luo, Garvesh Raskutti et al.

In this paper, we develop a novel procedure for low-rank tensor regression, namely \emph{\underline{I}mportance \underline{S}ketching \underline{L}ow-rank \underline{E}stimation for \underline{T}ensors} (ISLET). The central idea behind ISLET is \emph{importance sketching}, i.e., carefully designed sketches based on both the responses and low-dimensional structure of the parameter of interest. We show that the proposed method is sharply minimax optimal in terms of the mean-squared error under low-rank Tucker assumptions and under randomized Gaussian ensemble design. In addition, if a tensor is low-rank with group sparsity, our procedure also achieves minimax optimality. Further, we show through numerical study that ISLET achieves comparable or better mean-squared error performance to existing state-of-the-art methods while having substantial storage and run-time advantages including capabilities for parallel and distributed computing. In particular, our procedure performs reliable estimation with tensors of dimension $p = O(10^8)$ and is $1$ or $2$ orders of magnitude faster than baseline methods.

MLJan 31, 2019
Minimizing Negative Transfer of Knowledge in Multivariate Gaussian Processes: A Scalable and Regularized Approach

Raed Kontar, Garvesh Raskutti, Shiyu Zhou

Recently there has been an increasing interest in the multivariate Gaussian process (MGP) which extends the Gaussian process (GP) to deal with multiple outputs. One approach to construct the MGP and account for non-trivial commonalities amongst outputs employs a convolution process (CP). The CP is based on the idea of sharing latent functions across several convolutions. Despite the elegance of the CP construction, it provides new challenges that need yet to be tackled. First, even with a moderate number of outputs, model building is extremely prohibitive due to the huge increase in computational demands and number of parameters to be estimated. Second, the negative transfer of knowledge may occur when some outputs do not share commonalities. In this paper we address these issues. We propose a regularized pairwise modeling approach for the MGP established using CP. The key feature of our approach is to distribute the estimation of the full multivariate model into a group of bivariate GPs which are individually built. Interestingly pairwise modeling turns out to possess unique characteristics, which allows us to tackle the challenge of negative transfer through penalizing the latent function that facilitates information sharing in each bivariate model. Predictions are then made through combining predictions from the bivariate models within a Bayesian framework. The proposed method has excellent scalability when the number of outputs is large and minimizes the negative transfer of knowledge between uncorrelated outputs. Statistical guarantees for the proposed method are studied and its advantageous features are demonstrated through numerical studies.

MLNov 7, 2018
Estimating Network Structure from Incomplete Event Data

Benjamin Mark, Garvesh Raskutti, Rebecca Willett

Multivariate Bernoulli autoregressive (BAR) processes model time series of events in which the likelihood of current events is determined by the times and locations of past events. These processes can be used to model nonlinear dynamical systems corresponding to criminal activity, responses of patients to different medical treatment plans, opinion dynamics across social networks, epidemic spread, and more. Past work examines this problem under the assumption that the event data is complete, but in many cases only a fraction of events are observed. Incomplete observations pose a significant challenge in this setting because the unobserved events still govern the underlying dynamical system. In this work, we develop a novel approach to estimating the parameters of a BAR process in the presence of unobserved events via an unbiased estimator of the complete data log-likelihood function. We propose a computationally efficient estimation algorithm which approximates this estimator via Taylor series truncation and establish theoretical results for both the statistical error and optimization error of our algorithm. We further justify our approach by testing our method on both simulated data and a real data set consisting of crimes recorded by the city of Chicago.

MLMar 20, 2018
Graph-based regularization for regression problems with alignment and highly-correlated designs

Yuan Li, Benjamin Mark, Garvesh Raskutti et al.

Sparse models for high-dimensional linear regression and machine learning have received substantial attention over the past two decades. Model selection, or determining which features or covariates are the best explanatory variables, is critical to the interpretability of a learned model. Much of the current literature assumes that covariates are only mildly correlated. However, in many modern applications covariates are highly correlated and do not exhibit key properties (such as the restricted eigenvalue condition, restricted isometry property, or other related assumptions). This work considers a high-dimensional regression setting in which a graph governs both correlations among the covariates and the similarity among regression coefficients -- meaning there is \emph{alignment} between the covariates and regression coefficients. Using side information about the strength of correlations among features, we form a graph with edge weights corresponding to pairwise covariances. This graph is used to define a graph total variation regularizer that promotes similar weights for correlated features. This work shows how the proposed graph-based regularization yields mean-squared error guarantees for a broad range of covariance graph structures. These guarantees are optimal for many specific covariance graphs, including block and lattice graphs. Our proposed approach outperforms other methods for highly-correlated design in a variety of experiments on synthetic data and real biochemistry data.

MLFeb 13, 2018
Network Estimation from Point Process Data

Benjamin Mark, Garvesh Raskutti, Rebecca Willett

Consider observing a collection of discrete events within a network that reflect how network nodes influence one another. Such data are common in spike trains recorded from biological neural networks, interactions within a social network, and a variety of other settings. Data of this form may be modeled as self-exciting point processes, in which the likelihood of future events depends on the past events. This paper addresses the problem of estimating self-excitation parameters and inferring the underlying functional network structure from self-exciting point process data. Past work in this area was limited by strong assumptions which are addressed by the novel approach here. Specifically, in this paper we (1) incorporate saturation in a point process model which both ensures stability and models non-linear thresholding effects; (2) impose general low-dimensional structural assumptions that include sparsity, group sparsity and low-rankness that allows bounds to be developed in the high-dimensional setting; and (3) incorporate long-range memory effects through moving average and higher-order auto-regressive components. Using our general framework, we provide a number of novel theoretical guarantees for high-dimensional self-exciting point processes that reflect the role played by the underlying network structure and long-term memory. We also provide simulations and real data examples to support our methodology and main results.

MLJan 23, 2018
Non-parametric Sparse Additive Auto-regressive Network Models

Hao Henry Zhou, Garvesh Raskutti

Consider a multi-variate time series $(X_t)_{t=0}^{T}$ where $X_t \in \mathbb{R}^d$ which may represent spike train responses for multiple neurons in a brain, crime event data across multiple regions, and many others. An important challenge associated with these time series models is to estimate an influence network between the $d$ variables, especially when the number of variables $d$ is large meaning we are in the high-dimensional setting. Prior work has focused on parametric vector auto-regressive models. However, parametric approaches are somewhat restrictive in practice. In this paper, we use the non-parametric sparse additive model (SpAM) framework to address this challenge. Using a combination of $β$ and $φ$-mixing properties of Markov chains and empirical process techniques for reproducing kernel Hilbert spaces (RKHSs), we provide upper bounds on mean-squared error in terms of the sparsity $s$, logarithm of the dimension $\log d$, number of time points $T$, and the smoothness of the RKHSs. Our rates are sharp up to logarithm factors in many cases. We also provide numerical experiments that support our theoretical results and display potential advantages of using our non-parametric SpAM framework for a Chicago crime dataset.

MLApr 28, 2017
Learning Quadratic Variance Function (QVF) DAG models via OverDispersion Scoring (ODS)

Gunwoong Park, Garvesh Raskutti

Learning DAG or Bayesian network models is an important problem in multi-variate causal inference. However, a number of challenges arises in learning large-scale DAG models including model identifiability and computational complexity since the space of directed graphs is huge. In this paper, we address these issues in a number of steps for a broad class of DAG models where the noise or variance is signal-dependent. Firstly we introduce a new class of identifiable DAG models, where each node has a distribution where the variance is a quadratic function of the mean (QVF DAG models). Our QVF DAG models include many interesting classes of distributions such as Poisson, Binomial, Geometric, Exponential, Gamma and many other distributions in which the noise variance depends on the mean. We prove that this class of QVF DAG models is identifiable, and introduce a new algorithm, the OverDispersion Scoring (ODS) algorithm, for learning large-scale QVF DAG models. Our algorithm is based on firstly learning the moralized or undirected graphical model representation of the DAG to reduce the DAG search-space, and then exploiting the quadratic variance property to learn the causal ordering. We show through theoretical results and simulations that our algorithm is statistically consistent in the high-dimensional p>n setting provided that the degree of the moralized graph is bounded and performs well compared to state-of-the-art DAG-learning algorithms.

MLNov 30, 2016
Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

Han Chen, Garvesh Raskutti, Ming Yuan

In this paper, we consider the problem of learning high-dimensional tensor regression problems with low-rank structure. One of the core challenges associated with learning high-dimensional models is computation since the underlying optimization problems are often non-convex. While convex relaxations could lead to polynomial-time algorithms they are often slow in practice. On the other hand, limited theoretical guarantees exist for non-convex methods. In this paper we provide a general framework that provides theoretical guarantees for learning high-dimensional tensor regression models under different low-rank structural assumptions using the projected gradient descent algorithm applied to a potentially non-convex constraint set $Θ$ in terms of its \emph{localized Gaussian width}. We juxtapose our theoretical results for non-convex projected gradient descent algorithms with previous results on regularized convex approaches. The two main differences between the convex and non-convex approach are: (i) from a computational perspective whether the non-convex projection operator is computable and whether the projection has desirable contraction properties and (ii) from a statistical upper bound perspective, the non-convex approach has a superior rate for a number of examples. We provide three concrete examples of low-dimensional structure which address these issues and explain the pros and cons for the non-convex and convex approaches. We supplement our theoretical results with simulations which show that, under several common settings of generalized low rank tensor regression, the projected gradient descent approach is superior both in terms of statistical error and run-time provided the step-sizes of the projected descent algorithm are suitably chosen.

MLMay 9, 2016
Inference of High-dimensional Autoregressive Generalized Linear Models

Eric C. Hall, Garvesh Raskutti, Rebecca Willett

Vector autoregressive models characterize a variety of time series in which linear combinations of current and past observations can be used to accurately predict future observations. For instance, each element of an observation vector could correspond to a different node in a network, and the parameters of an autoregressive model would correspond to the impact of the network structure on the time series evolution. Often these models are used successfully in practice to learn the structure of social, epidemiological, financial, or biological neural networks. However, little is known about statistical guarantees on estimates of such models in non-Gaussian settings. This paper addresses the inference of the autoregressive parameters and associated network structure within a generalized linear model framework that includes Poisson and Bernoulli autoregressive processes. At the heart of this analysis is a sparsity-regularized maximum likelihood estimator. While sparsity-regularization is well-studied in the statistics and machine learning communities, those analysis methods cannot be applied to autoregressive generalized linear models because of the correlations and potential heteroscedasticity inherent in the observations. Sample complexity bounds are derived using a combination of martingale concentration inequalities and modern empirical process techniques for dependent random variables. These bounds, which are supported by several simulation studies, characterize the impact of various network parameters on estimator performance.

MLFeb 14, 2016
Identifiability Assumptions and Algorithm for Directed Graphical Models with Feedback

Gunwoong Park, Garvesh Raskutti

Directed graphical models provide a useful framework for modeling causal or directional relationships for multivariate data. Prior work has largely focused on identifiability and search algorithms for directed acyclic graphical (DAG) models. In many applications, feedback naturally arises and directed graphical models that permit cycles occur. In this paper we address the issue of identifiability for general directed cyclic graphical (DCG) models satisfying the Markov assumption. In particular, in addition to the faithfulness assumption which has already been introduced for cyclic models, we introduce two new identifiability assumptions, one based on selecting the model with the fewest edges and the other based on selecting the DCG model that entails the maximum number of d-separation rules. We provide theoretical results comparing these assumptions which show that: (1) selecting models with the largest number of d-separation rules is strictly weaker than the faithfulness assumption; (2) unlike for DAG models, selecting models with the fewest edges does not necessarily result in a milder assumption than the faithfulness assumption. We also provide connections between our two new principles and minimality assumptions. We use our identifiability assumptions to develop search algorithms for small-scale DCG models. Our simulation study supports our theoretical results, showing that the algorithms based on our two new principles generally out-perform algorithms based on the faithfulness assumption in terms of selecting the true skeleton for DCG models.

MLMay 25, 2015
Statistical and Algorithmic Perspectives on Randomized Sketching for Ordinary Least-Squares -- ICML

Garvesh Raskutti, Michael Mahoney

We consider statistical and algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior results show that, from an \emph{algorithmic perspective}, when using sketching matrices constructed from random projections and leverage-score sampling, if the number of samples $r$ much smaller than the original sample size $n$, then the worst-case (WC) error is the same as solving the original problem, up to a very small relative error. From a \emph{statistical perspective}, one typically considers the mean-squared error performance of randomized sketching algorithms, when data are generated according to a statistical linear model. In this paper, we provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling algorithms. Among other results, we show that the RE can be upper bounded when $r$ is much smaller than $n$, while the PE typically requires the number of samples $r$ to be substantially larger. Lower bounds developed in subsequent work show that our upper bounds on PE can not be improved.

MLJun 23, 2014
A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

Garvesh Raskutti, Michael Mahoney

We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$, sketching algorithms use a sketching matrix, $S\in\mathbb{R}^{r \times n}$ with $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, sketching algorithms solve the LS problem using only the sketched data $(SX, SY)$. Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input $X$ and $Y$, and instead it has been assumed that the data $(X,Y)$ are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with $p < r \ll n$, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data $(X, Y)$ are generated according to a statistical model $Y = X β+ ε$, where $ε$ is a noise process. We provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when $p < r \ll n$ while the PE typically requires the sample size $r$ to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved.

MLOct 29, 2013
The Information Geometry of Mirror Descent

Garvesh Raskutti, Sayan Mukherjee

Information geometry applies concepts in differential geometry to probability and statistics and is especially useful for parameter estimation in exponential families where parameters are known to lie on a Riemannian manifold. Connections between the geometric properties of the induced manifold and statistical properties of the estimation problem are well-established. However developing first-order methods that scale to larger problems has been less of a focus in the information geometry community. The best known algorithm that incorporates manifold structure is the second-order natural gradient descent algorithm introduced by Amari. On the other hand, stochastic approximation methods have led to the development of first-order methods for optimizing noisy objective functions. A recent generalization of the Robbins-Monro algorithm known as mirror descent, developed by Nemirovski and Yudin is a first order method that induces non-Euclidean geometries. However current analysis of mirror descent does not precisely characterize the induced non-Euclidean geometry nor does it consider performance in terms of statistical relative efficiency. In this paper, we prove that mirror descent induced by Bregman divergences is equivalent to the natural gradient descent algorithm on the dual Riemannian manifold. Using this equivalence, it follows that (1) mirror descent is the steepest descent direction along the Riemannian manifold of the exponential family; (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound and (3) natural gradient descent for manifolds corresponding to exponential families can be implemented as a first-order method through mirror descent.

STJul 1, 2013
Learning directed acyclic graphs based on sparsest permutations

Garvesh Raskutti, Caroline Uhler

We consider the problem of learning a Bayesian network or directed acyclic graph (DAG) model from observational data. A number of constraint-based, score-based and hybrid algorithms have been developed for this purpose. For constraint-based methods, statistical consistency guarantees typically rely on the faithfulness assumption, which has been show to be restrictive especially for graphs with cycles in the skeleton. However, there is only limited work on consistency guarantees for score-based and hybrid algorithms and it has been unclear whether consistency guarantees can be proven under weaker conditions than the faithfulness assumption. In this paper, we propose the sparsest permutation (SP) algorithm. This algorithm is based on finding the causal ordering of the variables that yields the sparsest DAG. We prove that this new score-based method is consistent under strictly weaker conditions than the faithfulness assumption. We also demonstrate through simulations on small DAGs that the SP algorithm compares favorably to the constraint-based PC and SGS algorithms as well as the score-based Greedy Equivalence Search and hybrid Max-Min Hill-Climbing method. In the Gaussian setting, we prove that our algorithm boils down to finding the permutation of the variables with sparsest Cholesky decomposition for the inverse covariance matrix. Using this connection, we show that in the oracle setting, where the true covariance matrix is known, the SP algorithm is in fact equivalent to $\ell_0$-penalized maximum likelihood estimation.

MLJun 15, 2013
Early stopping and non-parametric regression: An optimal data-dependent stopping rule

Garvesh Raskutti, Martin J. Wainwright, Bin Yu

The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator.