Devavrat Shah

LG
h-index63
64papers
1,821citations
Novelty52%
AI Score51

64 Papers

MLJun 2, 2023
Auditing for Human Expertise

Rohan Alur, Loren Laine, Darrick K. Li et al. · mit

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (`features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information that is not available to a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that -- even absent normative concerns about accountability or interpretability -- accuracy is insufficient to justify algorithmic automation.

LGNov 14, 2022
On counterfactual inference with unobserved confounding

Abhin Shah, Raaz Dwivedi, Devavrat Shah et al. · harvard, mit

Given an observational study with $n$ independent but heterogeneous units, our goal is to learn the counterfactual distribution for each unit using only one $p$-dimensional sample per unit containing covariates, interventions, and outcomes. Specifically, we allow for unobserved confounding that introduces statistical biases between interventions and outcomes as well as exacerbates the heterogeneity across units. Modeling the conditional distribution of the outcomes as an exponential family, we reduce learning the unit-level counterfactual distributions to learning $n$ exponential family distributions with heterogeneous parameters and only one sample per distribution. We introduce a convex objective that pools all $n$ samples to jointly learn all $n$ parameter vectors, and provide a unit-wise mean squared error bound that scales linearly with the metric entropy of the parameter space. For example, when the parameters are $s$-sparse linear combination of $k$ known vectors, the error is $O(s\log k/p)$. En route, we derive sufficient conditions for compactly supported distributions to satisfy the logarithmic Sobolev inequality. As an application of the framework, our results enable consistent imputation of sparsely missing covariates.

MLNov 25, 2022
Doubly robust nearest neighbors in factor models

Raaz Dwivedi, Katherine Tian, Sabina Tomkins et al. · harvard, mit

We introduce and analyze an improved variant of nearest neighbors (NN) for estimation with missing data in latent factor models. We consider a matrix completion problem with missing data, where the $(i, t)$-th entry, when observed, is given by its mean $f(u_i, v_t)$ plus mean-zero noise for an unknown function $f$ and latent factors $u_i$ and $v_t$. Prior NN strategies, like unit-unit NN, for estimating the mean $f(u_i, v_t)$ relies on existence of other rows $j$ with $u_j \approx u_i$. Similarly, time-time NN strategy relies on existence of columns $t'$ with $v_{t'} \approx v_t$. These strategies provide poor performance respectively when similar rows or similar columns are not available. Our estimate is doubly robust to this deficit in two ways: (1) As long as there exist either good row or good column neighbors, our estimate provides a consistent estimate. (2) Furthermore, if both good row and good column neighbors exist, it provides a (near-)quadratic improvement in the non-asymptotic error and admits a significantly narrower asymptotic confidence interval when compared to both unit-unit or time-time NN.

MLFeb 4, 2023
Counterfactual Identifiability of Bijective Causal Models

Arash Nasr-Esfahany, Mohammad Alizadeh, Devavrat Shah

We study counterfactual identifiability in causal models with bijective generation mechanisms (BGM), a class that generalizes several widely-used causal models in the literature. We establish their counterfactual identifiability for three common causal structures with unobserved confounding, and propose a practical learning method that casts learning a BGM as structured generative modeling. Learned BGMs enable efficient counterfactual estimation and can be obtained using a variety of deep conditional generative models. We evaluate our techniques in a visual task and demonstrate its application in a real-world video streaming simulation task.

LGFeb 4, 2023
Matrix Estimation for Individual Fairness

Cindy Y. Zhang, Sarah H. Cen, Devavrat Shah

In recent years, multiple notions of algorithmic fairness have arisen. One such notion is individual fairness (IF), which requires that individuals who are similar receive similar treatment. In parallel, matrix estimation (ME) has emerged as a natural paradigm for handling noisy data with missing values. In this work, we connect the two concepts. We show that pre-processing data using ME can improve an algorithm's IF without sacrificing performance. Specifically, we show that using a popular ME method known as singular value thresholding (SVT) to pre-process the data provides a strong IF guarantee under appropriate conditions. We then show that, under analogous conditions, SVT pre-processing also yields estimates that are consistent and approximately minimax optimal. As such, the ME pre-processing step does not, under the stated conditions, increase the prediction error of the base algorithm, i.e., does not impose a fairness-performance trade-off. We verify these results on synthetic and real data.

LGJun 16, 2022
Gradient Descent for Low-Rank Functions

Romain Cosson, Ali Jadbabaie, Anuran Makur et al.

Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an $ε$-approximate stationary point of a $p$-dimensional function by first identifying $r \leq p$ significant directions, and then estimating the true $p$-dimensional gradient at every iteration by computing directional derivatives only along those $r$ directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are $\mathcal{O}(r \log(1/ε) + rp)$ and $\mathcal{O}(r/ε^2 + rp)$, respectively. When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/ε))$ and $\mathcal{O}(p/ε^2)$ of {\gd} in the strongly convex and non-convex settings, respectively. Thus, LRGD significantly reduces the computational cost of gradient-based methods for sufficiently low-rank functions. In the course of our analysis, we also formally define and characterize the classes of exact and approximately low-rank functions.

CYApr 20, 2023
A User-Driven Framework for Regulating and Auditing Social Media

Sarah H. Cen, Aleksander Madry, Devavrat Shah

People form judgments and make decisions based on the information that they observe. A growing portion of that information is not only provided, but carefully curated by social media platforms. Although lawmakers largely agree that platforms should not operate without any oversight, there is little consensus on how to regulate social media. There is consensus, however, that creating a strict, global standard of "acceptable" content is untenable (e.g., in the US, it is incompatible with Section 230 of the Communications Decency Act and the First Amendment). In this work, we propose that algorithmic filtering should be regulated with respect to a flexible, user-driven baseline. We provide a concrete framework for regulating and auditing a social media platform according to such a baseline. In particular, we introduce the notion of a baseline feed: the content that a user would see without filtering (e.g., on Twitter, this could be the chronological timeline). We require that the feeds a platform filters contain "similar" informational content as their respective baseline feeds, and we design a principled way to measure similarity. This approach is motivated by related suggestions that regulations should increase user agency. We present an auditing procedure that checks whether a platform honors this requirement. Notably, the audit needs only black-box access to a platform's filtering algorithm, and it does not access or infer private user information. We provide theoretical guarantees on the strength of the audit. We further show that requiring closeness between filtered and baseline feeds does not impose a large performance cost, nor does it create echo chambers.

LGNov 4, 2023
Estimating Ground Reaction Forces from Inertial Sensors

Bowen Song, Marco Paolieri, Harper E. Stewart et al.

Objective: Our aim is to determine if data collected with inertial measurement units (IMUs) during steady-state running could be used to estimate ground reaction forces (GRFs) and to derive biomechanical variables (e.g., contact time, impulse, change in velocity) using lightweight machine-learning approaches. In contrast, state-of-the-art estimation using LSTMs suffers from prohibitive inference times on edge devices, requires expensive training and hyperparameter optimization, and results in black box models. Methods: We proposed a novel lightweight solution, SVD Embedding Regression (SER), using linear regression between SVD embeddings of IMU data and GRF data. We also compared lightweight solutions including SER and k-Nearest-Neighbors (KNN) regression with state-of-the-art LSTMs. Results: We performed extensive experiments to evaluate these techniques under multiple scenarios and combinations of IMU signals and quantified estimation errors for predicting GRFs and biomechanical variables. We did this using training data from different athletes, from the same athlete, or both, and we explored the use of acceleration and angular velocity data from sensors at different locations (sacrum and shanks). Conclusion: Our results illustrated that lightweight solutions such as SER and KNN can be similarly accurate or more accurate than LSTMs. The use of personal data reduced estimation errors of all methods, particularly for most biomechanical variables (as compared to GRFs); moreover, this gain was more pronounced in the lightweight methods. Significance: The study of GRFs is used to characterize the mechanical loading experienced by individuals in movements such as running, which is clinically applicable to identify athletes at risk for stress-related injuries.

LGSep 12, 2023
On Computationally Efficient Learning of Exponential Family Distributions

Abhin Shah, Devavrat Shah, Gregory W. Wornell

We consider the classical problem of learning, with arbitrary accuracy, the natural parameters of a $k$-parameter truncated \textit{minimal} exponential family from i.i.d. samples in a computationally and statistically efficient manner. We focus on the setting where the support as well as the natural parameters are appropriately bounded. While the traditional maximum likelihood estimator for this class of exponential family is consistent, asymptotically normal, and asymptotically efficient, evaluating it is computationally hard. In this work, we propose a novel loss function and a computationally efficient estimator that is consistent as well as asymptotically normal under mild conditions. We show that, at the population level, our method can be viewed as the maximum likelihood estimation of a re-parameterized distribution belonging to the same class of exponential family. Further, we show that our estimator can be interpreted as a solution to minimizing a particular Bregman score as well as an instance of minimizing the \textit{surrogate} likelihood. We also provide finite sample guarantees to achieve an error (in $\ell_2$-norm) of $α$ in the parameter estimation with sample complexity $O({\sf poly}(k)/α^2)$. Our method achives the order-optimal sample complexity of $O({\sf log}(k)/α^2)$ when tailored for node-wise-sparse Markov random fields. Finally, we demonstrate the performance of our estimator via numerical experiments.

LGJun 7, 2023
Exploiting Observation Bias to Improve Matrix Completion

Yassir Jedra, Sean Mann, Charlotte Park et al.

We consider a variant of matrix completion where entries are revealed in a biased manner. We wish to understand the extent to which such bias can be exploited in improving predictions. Towards that, we propose a natural model where the observation pattern and outcome of interest are driven by the same set of underlying latent (or unobserved) factors. We devise Mask Nearest Neighbor (MNN), a novel two-stage matrix completion algorithm: first, it recovers (distances between) the latent factors by utilizing matrix estimation for the fully observed noisy binary matrix, corresponding to the observation pattern; second, it utilizes the recovered latent factors as features and sparsely observed noisy outcomes as labels to perform non-parametric supervised learning. Our analysis reveals that MNN enjoys entry-wise finite-sample error rates that are competitive with corresponding supervised learning parametric rates. Despite not having access to the latent factors and dealing with biased observations, MNN exhibits such competitive performance via only exploiting the shared information between the bias and outcomes. Finally, through empirical evaluation using a real-world dataset, we find that with MNN, the estimates have 28x smaller mean squared error compared to traditional matrix completion methods, suggesting the utility of the model and method proposed in this work.

EMOct 20, 2022
Network Synthetic Interventions: A Causal Framework for Panel Data Under Network Interference

Anish Agarwal, Sarah H. Cen, Devavrat Shah et al.

We propose a generalization of the synthetic controls and synthetic interventions methodology to incorporate network interference. We consider the estimation of unit-specific potential outcomes from panel data in the presence of spillover across units and unobserved confounding. Key to our approach is a novel latent factor model that takes into account network interference and generalizes the factor models typically used in panel data settings. We propose an estimator, Network Synthetic Interventions (NSI), and show that it consistently estimates the mean outcomes for a unit under an arbitrary set of counterfactual treatments for the network. We further establish that the estimator is asymptotically normal. We furnish two validity tests for whether the NSI estimator reliably generalizes to produce accurate counterfactual estimates. We provide a novel graph-based experiment design that guarantees the NSI estimator produces accurate counterfactual estimates, and also analyze the sample complexity of the proposed design. We conclude with simulations that corroborate our theoretical findings.

LGFeb 21, 2025
Explaining the Success of Nearest Neighbor Methods in Prediction

George H. Chen, Devavrat Shah

Many modern methods for prediction leverage nearest neighbor search to find past training examples most similar to a test example, an idea that dates back in text to at least the 11th century and has stood the test of time. This monograph aims to explain the success of these methods, both in theory, for which we cover foundational nonasymptotic statistical guarantees on nearest-neighbor-based regression and classification, and in practice, for which we gather prominent methods for approximate nearest neighbor search that have been essential to scaling prediction systems reliant on nearest neighbor analysis to handle massive datasets. Furthermore, we discuss connections to learning distances for use with nearest neighbor methods, including how random decision trees and ensemble methods learn nearest neighbor structure, as well as recent developments in crowdsourcing and graphons. In terms of theory, our focus is on nonasymptotic statistical guarantees, which we state in the form of how many training data and what algorithm parameters ensure that a nearest neighbor prediction method achieves a user-specified error tolerance. We begin with the most general of such results for nearest neighbor and related kernel regression and classification in general metric spaces. In such settings in which we assume very little structure, what enables successful prediction is smoothness in the function being estimated for regression, and a low probability of landing near the decision boundary for classification. In practice, these conditions could be difficult to verify for a real dataset. We then cover recent guarantees on nearest neighbor prediction in the three case studies of time series forecasting, recommending products to people over time, and delineating human organs in medical images by looking at image patches. In these case studies, clustering structure enables successful prediction.

MLSep 22, 2024
Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

Jia Wan, Sean R. Sinclair, Devavrat Shah et al.

We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs) where the state space is partitioned into exogenous and endogenous components. Exogenous states evolve stochastically, independent of the agent's actions, while endogenous states evolve deterministically based on both state components and actions. Exo-MDPs are useful for applications including inventory control, portfolio management, and ride-sharing. Our first result is structural, establishing a representational equivalence between the classes of discrete MDPs, Exo-MDPs, and discrete linear mixture MDPs. Specifically, any discrete MDP can be represented as an Exo-MDP, and the transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing that Exo-MDPs are instances of linear mixture MDPs. For unobserved exogenous states, we prove a regret upper bound of $O(H^{3/2}d\sqrt{K})$ over $K$ trajectories of horizon $H$, with $d$ as the size of the exogenous state space, and establish nearly-matching lower bounds. Our findings demonstrate how Exo-MDPs decouple sample complexity from action and endogenous state sizes, and we validate our theoretical insights with experiments on inventory control.

IRMay 7
OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries

Diane Tchuindjo, Devavrat Shah, Omar Khattab

Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an implicit stance, chat logs that demonstrate a particular failure mode, or transcripts that match an abstract scenario. We study three mechanisms through which obliqueness may arise and introduce OBLIQ-Bench, a suite of five oblique search problems over real long-tail corpora. OBLIQ-Bench exposes an overlooked asymmetry between retrieval and verification, where reasoning LLMs reliably recognize latent relevance whenever relevant documents are surfaced, but even sophisticated retrieval pipelines fail to surface most relevant documents in the first place. We hope that OBLIQ-Bench will drive research into retrieval architectures that efficiently capture latent patterns and implicit signals in large corpora.

LGFeb 1, 2024
Human Expertise in Algorithmic Prediction

Rohan Alur, Manish Raghavan, Devavrat Shah

We introduce a novel framework for incorporating human expertise into algorithmic predictions. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to predictive algorithms. We argue that this framing clarifies the problem of human-AI collaboration in prediction tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We find empirically that although algorithms often outperform their human counterparts on average, human judgment can improve algorithmic predictions on specific instances (which can be identified ex-ante). In an X-ray classification task, we find that this subset constitutes nearly $30\%$ of the patient population. Our approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

LGOct 11, 2024
Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Rohan Alur, Loren Laine, Darrick K. Li et al.

We introduce a novel framework for human-AI collaboration in prediction and decision tasks. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to any feasible predictive algorithm. We argue that this framing clarifies the problem of human-AI collaboration in prediction and decision tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We demonstrate the utility of our framework in a case study of emergency room triage decisions, where we find that although algorithmic risk scores are highly competitive with physicians, there is strong evidence that physician judgments provide signal which could not be replicated by any predictive algorithm. This insight yields a range of natural decision rules which leverage the complementary strengths of human experts and predictive algorithms.

EMApr 2, 2025
A Causal Inference Framework for Data Rich Environments

Alberto Abadie, Anish Agarwal, Devavrat Shah

We propose a formal model for counterfactual estimation with unobserved confounding in "data-rich" settings, i.e., where there are a large number of units and a large number of measurements per unit. Our model provides a bridge between the structural causal model view of causal inference common in the graphical models literature with that of the latent factor model view common in the potential outcomes literature. We show how classic models for potential outcomes and treatment assignments fit within our framework. We provide an identification argument for the average treatment effect, the average treatment effect on the treated, and the average treatment effect on the untreated. For any estimator that has a fast enough estimation error rate for a certain nuisance parameter, we establish it is consistent for these various causal parameters. We then show principal component regression is one such estimator that leads to consistent estimation, and we analyze the minimal smoothness required of the potential outcomes function for consistency.

LGNov 18, 2025
Synthetic Survival Control: Extending Synthetic Controls for "When-If" Decision

Jessy Xinyi Han, Devavrat Shah

Estimating causal effects on time-to-event outcomes from observational data is particularly challenging due to censoring, limited sample sizes, and non-random treatment assignment. The need for answering such "when-if" questions--how the timing of an event would change under a specified intervention--commonly arises in real-world settings with heterogeneous treatment adoption and confounding. To address these challenges, we propose Synthetic Survival Control (SSC) to estimate counterfactual hazard trajectories in a panel data setting where multiple units experience potentially different treatments over multiple periods. In such a setting, SSC estimates the counterfactual hazard trajectory for a unit of interest as a weighted combination of the observed trajectories from other units. To provide formal justification, we introduce a panel framework with a low-rank structure for causal survival analysis. Indeed, such a structure naturally arises under classical parametric survival models. Within this framework, for the causal estimand of interest, we establish identification and finite sample guarantees for SSC. We validate our approach using a multi-country clinical dataset of cancer treatment outcomes, where the staggered introduction of new therapies creates a quasi-experimental setting. Empirically, we find that access to novel treatments is associated with improved survival, as reflected by lower post-intervention hazard trajectories relative to their synthetic counterparts. Given the broad relevance of survival analysis across medicine, economics, and public policy, our framework offers a general and interpretable tool for counterfactual survival inference using observational data.

LGSep 28, 2025
The Impossibility of Inverse Permutation Learning in Transformer Models

Rohan Alur, Chris Hays, Manish Raghavan et al.

In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

LGFeb 1, 2025
$k$-SVD with Gradient Descent

Yassir Jedra, Devavrat Shah

The emergence of modern compute infrastructure for iterative optimization has led to great interest in developing optimization-based approaches for a scalable computation of $k$-SVD, i.e., the $k\geq 1$ largest singular values and corresponding vectors of a matrix of rank $d \geq 1$. Despite lots of exciting recent works, all prior works fall short in this pursuit. Specifically, the existing results are either for the exact-parameterized (i.e., $k = d$) and over-parameterized (i.e., $k > d$) settings; or only establish local convergence guarantees; or use a step-size that requires problem-instance-specific oracle-provided information. In this work, we complete this pursuit by providing a gradient-descent method with a simple, universal rule for step-size selection (akin to pre-conditioning), that provably finds $k$-SVD for a matrix of any rank $d \geq 1$. We establish that the gradient method with random initialization enjoys global linear convergence for any $k, d \geq 1$. Our convergence analysis reveals that the gradient method has an attractive region, and within this attractive region, the method behaves like Heron's method (a.k.a. the Babylonian method). Our analytic results about the said attractive region imply that the gradient method can be enhanced by means of Nesterov's momentum-based acceleration technique. The resulting improved convergence rates match those of rather complicated methods typically relying on Lanczos iterations or variants thereof. We provide an empirical study to validate the theoretical results.

LGMay 25, 2023
SAMoSSA: Multivariate Singular Spectrum Analysis with Stochastic Autoregressive Noise

Abdullah Alomar, Munther Dahleh, Sean Mann et al.

The well-established practice of time series analysis involves estimating deterministic, non-stationary trend and seasonality components followed by learning the residual stochastic, stationary components. Recently, it has been shown that one can learn the deterministic non-stationary components accurately using multivariate Singular Spectrum Analysis (mSSA) in the absence of a correlated stationary component; meanwhile, in the absence of deterministic non-stationary components, the Autoregressive (AR) stationary component can also be learnt readily, e.g. via Ordinary Least Squares (OLS). However, a theoretical underpinning of multi-stage learning algorithms involving both deterministic and stationary components has been absent in the literature despite its pervasiveness. We resolve this open question by establishing desirable theoretical guarantees for a natural two-stage algorithm, where mSSA is first applied to estimate the non-stationary components despite the presence of a correlated stationary AR component, which is subsequently learned from the residual time series. We provide a finite-sample forecasting consistency bound for the proposed algorithm, SAMoSSA, which is data-driven and thus requires minimal parameter tuning. To establish theoretical guarantees, we overcome three hurdles: (i) we characterize the spectra of Page matrices of stable AR processes, thus extending the analysis of mSSA; (ii) we extend the analysis of AR process identification in the presence of arbitrary bounded perturbations; (iii) we characterize the out-of-sample or forecasting error, as opposed to solely considering model identification. Through representative empirical studies, we validate the superior performance of SAMoSSA compared to existing baselines. Notably, SAMoSSA's ability to account for AR noise structure yields improvements ranging from 5% to 37% across various benchmark datasets.

MLFeb 14, 2022
Counterfactual inference in sequential experiments

Raaz Dwivedi, Katherine Tian, Sabina Tomkins et al.

We consider after-study statistical inference for sequentially designed experiments wherein multiple units are assigned treatments for multiple time points using treatment policies that adapt over time. Our goal is to provide inference guarantees for the counterfactual mean at the smallest possible scale -- mean outcome under different treatments for each unit and each time -- with minimal assumptions on the adaptive treatment policy. Without any structural assumptions on the counterfactual means, this challenging task is infeasible due to more unknowns than observed data points. To make progress, we introduce a latent factor model over the counterfactual means that serves as a non-parametric generalization of the non-linear mixed effects model and the bilinear latent factor model considered in prior works. For estimation, we use a non-parametric method, namely a variant of nearest neighbors, and establish a non-asymptotic high probability error bound for the counterfactual mean for each unit and each time. Under regularity conditions, this bound leads to asymptotically valid confidence intervals for the counterfactual mean as the number of units and time points grows to $\infty$ together at suitable rates. We illustrate our theory via several simulations and a case study involving data from a mobile health clinical trial HeartSteps.

MLJan 7, 2022
Unifying Epidemic Models with Mixtures

Arnab Sarker, Ali Jadbabaie, Devavrat Shah

The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in developing data-driven solutions to controlling epidemics.

LGJan 6, 2022
Federated Optimization of Smooth Loss Functions

Ali Jadbabaie, Anuran Makur, Devavrat Shah

In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $ε$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $η$-Hölder smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $φm(p/ε)^{Θ(d/η)}$ and that of FedAve scales like $φm(p/ε)^{3/4}$ (neglecting sub-dominant factors), where $φ\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.

LGJan 5, 2022
CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation

Abdullah Alomar, Pouya Hamadanian, Arash Nasr-Esfahany et al.

We present CausalSim, a causal framework for unbiased trace-driven simulation. Current trace-driven simulators assume that the interventions being simulated (e.g., a new algorithm) would not affect the validity of the traces. However, real-world traces are often biased by the choices algorithms make during trace collection, and hence replaying traces under an intervention may lead to incorrect results. CausalSim addresses this challenge by learning a causal model of the system dynamics and latent factors capturing the underlying system conditions during trace collection. It learns these models using an initial randomized control trial (RCT) under a fixed set of algorithms, and then applies them to remove biases from trace data when simulating new algorithms. Key to CausalSim is mapping unbiased trace-driven simulation to a tensor completion problem with extremely sparse observations. By exploiting a basic distributional invariance property present in RCT data, CausalSim enables a novel tensor completion method despite the sparsity of observations. Our extensive evaluation of CausalSim on both real and synthetic datasets, including more than ten months of real data from the Puffer video streaming system shows it improves simulation accuracy, reducing errors by 53% and 61% on average compared to expert-designed and supervised learning baselines. Moreover, CausalSim provides markedly different insights about ABR algorithms compared to the biased baseline simulator, which we validate with a real deployment.

STDec 29, 2021
Time varying regression with hidden linear dynamics

Ali Jadbabaie, Horia Mania, Devavrat Shah et al.

We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work.

LGOct 28, 2021
A Computationally Efficient Method for Learning Exponential Family Distributions

Abhin Shah, Devavrat Shah, Gregory W. Wornell

We consider the question of learning the natural parameters of a $k$ parameter minimal exponential family from i.i.d. samples in a computationally and statistically efficient manner. We focus on the setting where the support as well as the natural parameters are appropriately bounded. While the traditional maximum likelihood estimator for this class of exponential family is consistent, asymptotically normal, and asymptotically efficient, evaluating it is computationally hard. In this work, we propose a computationally efficient estimator that is consistent as well as asymptotically normal under mild conditions. We provide finite sample guarantees to achieve an ($\ell_2$) error of $α$ in the parameter estimation with sample complexity $O(\mathrm{poly}(k/α))$ and computational complexity ${O}(\mathrm{poly}(k/α))$. To establish these results, we show that, at the population level, our method can be viewed as the maximum likelihood estimation of a re-parameterized distribution belonging to the same class of exponential family.

EMSep 30, 2021
Causal Matrix Completion

Anish Agarwal, Munther Dahleh, Devavrat Shah et al.

Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed that the entries of the matrix are "missing completely at random" (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of "latent confounders", i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, in the context of movie recommender systems -- a canonical application for matrix completion -- a user who vehemently dislikes horror films is unlikely to ever watch horror films. In general, these confounders yield "missing not at random" (MNAR) data, which can severely impact any inference procedure that does not correct for this bias. We develop a formal causal model for matrix completion through the language of potential outcomes, and provide novel identification arguments for a variety of causal estimands of interest. We design a procedure, which we call "synthetic nearest neighbors" (SNN), to estimate these causal estimands. We prove finite-sample consistency and asymptotic normality of our estimator. Our analysis also leads to new theoretical results for the matrix completion literature. In particular, we establish entry-wise, i.e., max-norm, finite-sample consistency and asymptotic normality results for matrix completion with MNAR data. As a special case, this also provides entry-wise bounds for matrix completion with MCAR data. Across simulated and real data, we demonstrate the efficacy of our proposed estimator.

DSFeb 19, 2021
Quantifying Variational Approximation for the Log-Partition Function

Romain Cosson, Devavrat Shah

Variational approximation, such as mean-field (MF) and tree-reweighted (TRW), provide a computationally efficient approximation of the log-partition function for a generic graphical model. TRW provably provides an upper bound, but the approximation ratio is generally not quantified. As the primary contribution of this work, we provide an approach to quantify the approximation ratio through the property of the underlying graph structure. Specifically, we argue that (a variant of) TRW produces an estimate that is within factor $\frac{1}{\sqrt{κ(G)}}$ of the true log-partition function for any discrete pairwise graphical model over graph $G$, where $κ(G) \in (0,1]$ captures how far $G$ is from tree structure with $κ(G) = 1$ for trees and $2/N$ for the complete graph over $N$ vertices. As a consequence, the approximation ratio is $1$ for trees, $\sqrt{(d+1)/2}$ for any graph with maximum average degree $d$, and $\stackrel{β\to\infty}{\approx} 1+1/(2β)$ for graphs with girth (shortest cycle) at least $β\log N$. In general, $κ(G)$ is the solution of a max-min problem associated with $G$ that can be evaluated in polynomial time for any graph. Using samples from the uniform distribution over the spanning trees of G, we provide a near linear-time variant that achieves an approximation ratio equal to the inverse of square-root of minimal (across edges) effective resistance of the graph. We connect our results to the graph partition-based approximation method and thus provide a unified perspective. Keywords: variational inference, log-partition function, spanning tree polytope, minimum effective resistance, min-max spanning tree, local inference

LGFeb 13, 2021
PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators

Anish Agarwal, Abdullah Alomar, Varkey Alumootil et al.

We consider offline reinforcement learning (RL) with heterogeneous agents under severe data scarcity, i.e., we only observe a single historical trajectory for every agent under an unknown, potentially sub-optimal policy. We find that the performance of state-of-the-art offline and model-based RL methods degrade significantly given such limited data availability, even for commonly perceived "solved" benchmark settings such as "MountainCar" and "CartPole". To address this challenge, we propose PerSim, a model-based offline RL approach which first learns a personalized simulator for each agent by collectively using the historical trajectories across all agents, prior to learning a policy. We do so by positing that the transition dynamics across agents can be represented as a latent function of latent factors associated with agents, states, and actions; subsequently, we theoretically establish that this function is well-approximated by a "low-rank" decomposition of separable agent, state, and action latent functions. This representation suggests a simple, regularized neural network architecture to effectively learn the transition dynamics per agent, even with scarce, offline data. We perform extensive experiments across several benchmark environments and RL methods. The consistent improvement of our approach, measured in terms of both state dynamics prediction and eventual reward, confirms the efficacy of our framework in leveraging limited historical data to simultaneously learn personalized policies across agents.

LGFeb 11, 2021
Regret, stability & fairness in matching markets with bandit learners

Sarah H. Cen, Devavrat Shah

Making an informed decision -- for example, when choosing a career or housing -- requires knowledge about the available options. Such knowledge is generally acquired through costly trial and error, but this learning process can be disrupted by competition. In this work, we study how competition affects the long-term outcomes of individuals as they learn. We build on a line of work that models this setting as a two-sided matching market with bandit learners. A recent result in this area states that it is impossible to simultaneously guarantee two natural desiderata: stability and low optimal regret for all agents. Resource-allocating platforms can point to this result as a justification for assigning good long-term outcomes to some agents and poor ones to others. We show that this impossibility need not hold true. In particular, by modeling two additional components of competition -- namely, costs and transfers -- we prove that it is possible to simultaneously guarantee four desiderata: stability, low optimal regret, fairness in the distribution of regret, and high social welfare.

LGNov 4, 2020
Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

Ali Jadbabaie, Anuran Makur, Devavrat Shah

In this paper, we consider the problem of empirical risk minimization (ERM) of smooth, strongly convex loss functions using iterative gradient-based methods. A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to $ε$-approximate solutions. For example, the oracle complexity of GD is $O(n\log(ε^{-1}))$, where $n$ is the number of training samples. When $n$ is large, this can be expensive in practice, and SGD is preferred due to its oracle complexity of $O(ε^{-1})$. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD in important regimes. Specifically, at every iteration, our proposed algorithm performs local polynomial regression to learn the gradient of the loss function, and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of our algorithm scales like $\tilde{O}((p ε^{-1})^{d/(2η)})$ (neglecting sub-dominant factors), where $d$ and $p$ are the data and parameter space dimensions, respectively, and the gradient of the loss function belongs to a $η$-Hölder class with respect to the data. Our proof extends the analysis of local polynomial regression in non-parametric statistics to provide interpolation guarantees in multivariate settings, and also exploits tools from the inexact GD literature. Unlike GD and SGD, the complexity of our method depends on $d$ and $p$. However, when $d$ is small and the loss function exhibits modest smoothness in the data, our algorithm beats GD and SGD in oracle complexity for a very broad range of $p$ and $ε$.

LGOct 28, 2020
On Learning Continuous Pairwise Markov Random Fields

Abhin Shah, Devavrat Shah, Gregory W. Wornell

We consider learning a sparse pairwise Markov Random Field (MRF) with continuous-valued variables from i.i.d samples. We adapt the algorithm of Vuffray et al. (2019) to this setting and provide finite-sample analysis revealing sample complexity scaling logarithmically with the number of variables, as in the discrete and Gaussian settings. Our approach is applicable to a large class of pairwise MRFs with continuous variables and also has desirable asymptotic properties, including consistency and normality under mild conditions. Further, we establish that the population version of the optimization criterion employed in Vuffray et al. (2019) can be interpreted as local maximum likelihood estimation (MLE). As part of our analysis, we introduce a robust variation of sparse linear regression a` la Lasso, which may be of interest in its own right.

STOct 27, 2020
On Model Identification and Out-of-Sample Prediction of Principal Component Regression: Applications to Synthetic Controls

Anish Agarwal, Devavrat Shah, Dennis Shen

We analyze principal component regression (PCR) in a high-dimensional error-in-variables setting with fixed design. Under suitable conditions, we show that PCR consistently identifies the unique model with minimum $\ell_2$-norm. These results enable us to establish non-asymptotic out-of-sample prediction guarantees that improve upon the best known rates. In the course of our analysis, we introduce a natural linear algebraic condition between the in- and out-of-sample covariates, which allows us to avoid distributional assumptions for out-of-sample predictions. Our simulations illustrate the importance of this condition for generalization, even under covariate shifts. Accordingly, we construct a hypothesis test to check when this conditions holds in practice. As a byproduct, our results also lead to novel results for the synthetic controls literature, a leading approach for policy evaluation. To the best of our knowledge, our prediction guarantees for the fixed design setting have been elusive in both the high-dimensional error-in-variables and synthetic controls literatures.

LGJun 24, 2020
On Multivariate Singular Spectrum Analysis and its Variants

Anish Agarwal, Abdullah Alomar, Devavrat Shah

We introduce and analyze a variant of multivariate singular spectrum analysis (mSSA), a popular time series method to impute and forecast a multivariate time series. Under a spatio-temporal factor model we introduce, given $N$ time series and $T$ observations per time series, we establish prediction mean-squared-error for both imputation and out-of-sample forecasting effectively scale as $1 / \sqrt{\min(N, T )T}$. This is an improvement over: (i) $1 /\sqrt{T}$ error scaling of SSA, the restriction of mSSA to a univariate time series; (ii) $1/\min(N, T)$ error scaling for matrix estimation methods which do not exploit temporal structure in the data. The spatio-temporal model we introduce includes any finite sum and products of: harmonics, polynomials, differentiable periodic functions, and Holder continuous functions. Our out-of-sample forecasting result could be of independent interest for online learning under a spatio-temporal factor model. Empirically, on benchmark datasets, our variant of mSSA performs competitively with state-of-the-art neural-network time series methods (e.g. DeepAR, LSTM) and significantly outperforms classical methods such as vector autoregression (VAR). Finally, we propose extensions of mSSA: (i) a variant to estimate time-varying variance of a time series; (ii) a tensor variant which has better sample complexity for certain regimes of $N$ and $T$.

MLJun 15, 2020
Estimation of Skill Distributions

Ali Jadbabaie, Anuran Makur, Devavrat Shah

In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of game outcomes are governed by the Bradley-Terry-Luce (or multinomial logit) model, where the probability of an agent beating another is the ratio between its skill level and the pairwise sum of skill levels, and the skill parameters are drawn from an unknown skill density of interest. The problem is, in essence, to learn a distribution from noisy, quantized observations. We propose a simple and tractable algorithm that learns the skill density with near-optimal minimax mean squared error scaling as $n^{-1+\varepsilon}$, for any $\varepsilon>0$, when the density is smooth. Our approach brings together prior work on learning skill parameters from pairwise comparisons with kernel density estimation from non-parametric statistics. Furthermore, we prove minimax lower bounds which establish minimax optimality of the skill parameter estimation technique used in our algorithm. These bounds utilize a continuum version of Fano's method along with a covering argument. We apply our algorithm to various soccer leagues and world cups, cricket world cups, and mutual funds. We find that the entropy of a learnt distribution provides a quantitative measure of skill, which provides rigorous explanations for popular beliefs about perceived qualities of sporting events, e.g., soccer league rankings. Finally, we apply our method to assess the skill distributions of mutual funds. Our results shed light on the abundance of low quality funds prior to the Great Recession of 2008, and the domination of the industry by more skilled funds after the financial crisis.

EMJun 13, 2020
Synthetic Interventions

Anish Agarwal, Devavrat Shah, Dennis Shen

The synthetic controls (SC) methodology is a prominent tool for policy evaluation in panel data applications. Researchers commonly justify the SC framework with a low-rank matrix factor model that assumes the potential outcomes are described by low-dimensional unit and time specific latent factors. In the recent work of [Abadie '20], one of the pioneering authors of the SC method posed the question of how the SC framework can be extended to multiple treatments. This article offers one resolution to this open question that we call synthetic interventions (SI). Fundamental to the SI framework is a low-rank tensor factor model, which extends the matrix factor model by including a latent factorization over treatments. Under this model, we propose a generalization of the standard SC-based estimators. We prove the consistency for one instantiation of our approach and provide conditions under which it is asymptotically normal. Moreover, we conduct a representative simulation to study its prediction performance and revisit the canonical SC case study of [Abadie-Diamond-Hainmueller '10] on the impact of anti-tobacco legislations by exploring related questions not previously investigated.

LGJun 11, 2020
Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

Devavrat Shah, Dogyoon Song, Zhi Xu et al.

We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If $Q$-function is Lipschitz continuous, then the minimal sample complexity for estimating $ε$-optimal $Q$-function is known to scale as $Ω(\frac{1}{ε^{d_1+d_2 +2}})$ per classical non-parametric learning theory, where $d_1$ and $d_2$ denote the dimensions of the state and action spaces respectively. The $Q$-function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of $Q$-functions parameterized by its "rank" $r$, which contains all Lipschitz $Q$-functions as $r \to \infty$. As our key contribution, we develop a simple, iterative learning algorithm that finds $ε$-optimal $Q$-function with sample complexity of $\widetilde{O}(\frac{1}{ε^{\max(d_1, d_2)+2}})$ when the optimal $Q$-function has low rank $r$ and the discounting factor $γ$ is below a certain threshold. Thus, this provides an exponential improvement in sample complexity. To enable our result, we develop a novel Matrix Estimation algorithm that faithfully estimates an unknown low-rank matrix in the $\ell_\infty$ sense even in the presence of arbitrary bounded noise, which might be of interest in its own right. Empirical results on several stochastic control tasks confirm the efficacy of our "low-rank" algorithms.

LGJun 8, 2020
Stable Reinforcement Learning with Unbounded State Space

Devavrat Shah, Qiaomin Xie, Zhi Xu

We consider the problem of reinforcement learning (RL) with unbounded state space motivated by the classical problem of scheduling in a queueing network. Traditional policies as well as error metric that are designed for finite, bounded or compact state space, require infinite samples for providing any meaningful performance guarantee (e.g. $\ell_\infty$ error) for unbounded state space. That is, we need a new notion of performance metric. As the main contribution of this work, inspired by the literature in queuing systems and control theory, we propose stability as the notion of "goodness": the state dynamics under the policy should remain in a bounded region with high probability. As a proof of concept, we propose an RL policy using Sparse-Sampling-based Monte Carlo Oracle and argue that it satisfies the stability property as long as the system dynamics under the optimal policy respects a Lyapunov function. The assumption of existence of a Lyapunov function is not restrictive as it is equivalent to the positive recurrence or stability property of any Markov chain, i.e., if there is any policy that can stabilize the system then it must possess a Lyapunov function. And, our policy does not utilize the knowledge of the specific Lyapunov function. To make our method sample efficient, we provide an improved, sample efficient Sparse-Sampling-based Monte Carlo Oracle with Lipschitz value function that may be of interest in its own right. Furthermore, we design an adaptive version of the algorithm, based on carefully constructed statistical tests, which finds the correct tuning parameter automatically.

EMApr 30, 2020
Two Burning Questions on COVID-19: Did shutting down the economy help? Can we (partially) reopen the economy without risking the second wave?

Anish Agarwal, Abdullah Alomar, Arnab Sarker et al.

As we reach the apex of the COVID-19 pandemic, the most pressing question facing us is: can we even partially reopen the economy without risking a second wave? We first need to understand if shutting down the economy helped. And if it did, is it possible to achieve similar gains in the war against the pandemic while partially opening up the economy? To do so, it is critical to understand the effects of the various interventions that can be put into place and their corresponding health and economic implications. Since many interventions exist, the key challenge facing policy makers is understanding the potential trade-offs between them, and choosing the particular set of interventions that works best for their circumstance. In this memo, we provide an overview of Synthetic Interventions (a natural generalization of Synthetic Control), a data-driven and statistically principled method to perform what-if scenario planning, i.e., for policy makers to understand the trade-offs between different interventions before having to actually enact them. In essence, the method leverages information from different interventions that have already been enacted across the world and fits it to a policy maker's setting of interest, e.g., to estimate the effect of mobility-restricting interventions on the U.S., we use daily death data from countries that enforced severe mobility restrictions to create a "synthetic low mobility U.S." and predict the counterfactual trajectory of the U.S. if it had indeed applied a similar intervention. Using Synthetic Interventions, we find that lifting severe mobility restrictions and only retaining moderate mobility restrictions (at retail and transit locations), seems to effectively flatten the curve. We hope this provides guidance on weighing the trade-offs between the safety of the population, strain on the healthcare system, and impact on the economy.

LGFeb 25, 2020
On Reinforcement Learning for Turn-based Zero-sum Markov Games

Devavrat Shah, Varun Somani, Qiaomin Xie et al.

We consider the problem of finding Nash equilibrium for two-player turn-based zero-sum games. Inspired by the AlphaGo Zero (AGZ) algorithm, we develop a Reinforcement Learning based approach. Specifically, we propose Explore-Improve-Supervise (EIS) method that combines "exploration", "policy improvement"' and "supervised learning" to find the value function and policy associated with Nash equilibrium. We identify sufficient conditions for convergence and correctness for such an approach. For a concrete instance of EIS where random policy is used for "exploration", Monte-Carlo Tree Search is used for "policy improvement" and Nearest Neighbors is used for "supervised learning", we establish that this method finds an $\varepsilon$-approximate value function of Nash equilibrium in $\widetilde{O}(\varepsilon^{-(d+4)})$ steps when the underlying state-space of the game is continuous and $d$-dimensional. This is nearly optimal as we establish a lower bound of $\widetildeΩ(\varepsilon^{-(d+2)})$ for any policy.

NENov 1, 2019
Short and Wide Network Paths

Lavanya Marla, Lav R. Varshney, Devavrat Shah et al.

Network flow is a powerful mathematical framework to systematically explore the relationship between structure and function in biological, social, and technological networks. We introduce a new pipelining model of flow through networks where commodities must be transported over single paths rather than split over several paths and recombined. We show this notion of pipelined network flow is optimized using network paths that are both short and wide, and develop efficient algorithms to compute such paths for given pairs of nodes and for all-pairs. Short and wide paths are characterized for many real-world networks. To further demonstrate the utility of this network characterization, we develop novel information-theoretic lower bounds on computation speed in nervous systems due to limitations from anatomical connectivity and physical noise. For the nematode Caenorhabditis elegans, we find these bounds are predictive of biological timescales of behavior. Further, we find the particular C. elegans connectome is globally less efficient for information flow than random networks, but the hub-and-spoke architecture of functional subcircuits is optimal under constraint on number of synapses. This suggests functional subcircuits are a primary organizational principle of this small invertebrate nervous system.

LGAug 3, 2019
Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering

Devavrat Shah, Christina Lee Yu

Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = Ω(n^{-3/2 + κ})$ for any arbitrarily small $κ> 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.

DBMar 17, 2019
tspDB: Time Series Predict DB

Anish Agarwal, Abdullah Alomar, Devavrat Shah

A major bottleneck of the current Machine Learning (ML) workflow is the time consuming, error prone engineering required to get data from a datastore or a database (DB) to the point an ML algorithm can be applied to it. Hence, we explore the feasibility of directly integrating prediction functionality on top of a data store or DB. Such a system ideally: (i) provides an intuitive prediction query interface which alleviates the unwieldy data engineering; (ii) provides state-of-the-art statistical accuracy while ensuring incremental model update, low model training time and low latency for making predictions. As the main contribution we explicitly instantiate a proof-of-concept, tspDB, which directly integrates with PostgreSQL. We rigorously test tspDB's statistical and computational performance against the state-of-the-art time series algorithms, including a Long-Short-Term-Memory (LSTM) neural network and DeepAR (industry standard deep learning library by Amazon). Statistically, on standard time series benchmarks, tspDB outperforms LSTM and DeepAR with 1.1-1.3x higher relative accuracy. Computationally, tspDB is 59-62x and 94-95x faster compared to LSTM and DeepAR in terms of median ML model training time and prediction query latency, respectively. Further, compared to PostgreSQL's bulk insert time and its SELECT query latency, tspDB is slower only by 1.3x and 2.6x respectively. That is, tspDB is a real-time prediction system in that its model training / prediction query time is similar to just inserting / reading data from a DB. As an algorithmic contribution, we introduce an incremental multivariate matrix factorization based time series method, which tspDB is built off. We show this method also allows one to produce reliable prediction intervals by accurately estimating the time-varying variance of a time series, thereby addressing an important problem in time series analysis.

LGFeb 28, 2019
On Robustness of Principal Component Regression

Anish Agarwal, Devavrat Shah, Dennis Shen et al.

Principal component regression (PCR) is a simple, but powerful and ubiquitously utilized method. Its effectiveness is well established when the covariates exhibit low-rank structure. However, its ability to handle settings with noisy, missing, and mixed-valued, i.e., discrete and continuous, covariates is not understood and remains an important open challenge. As the main contribution of this work we establish the robustness of PCR, without any change, in this respect and provide meaningful finite-sample analysis. To do so, we establish that PCR is equivalent to performing linear regression after pre-processing the covariate matrix via hard singular value thresholding (HSVT). As a result, in the context of counterfactual analysis using observational data, we show PCR is equivalent to the recently proposed robust variant of the synthetic control method, known as robust synthetic control (RSC). As an immediate consequence, we obtain finite-sample analysis of the RSC estimator that was previously absent. As an important contribution to the synthetic controls literature, we establish that an (approximate) linear synthetic control exists in the setting of a generalized factor model or latent variable model; traditionally in the literature, the existence of a synthetic control needs to be assumed to exist as an axiom. We further discuss a surprising implication of the robustness property of PCR with respect to noise, i.e., PCR can learn a good predictive model even if the covariates are tactfully transformed to preserve differential privacy. Finally, this work advances the state-of-the-art analysis for HSVT by establishing stronger guarantees with respect to the $\ell_{2, \infty}$-norm rather than the Frobenius norm as is commonly done in the matrix estimation literature, which may be of interest in its own right.

MLFeb 14, 2019
Non-Asymptotic Analysis of Monte Carlo Tree Search

Devavrat Shah, Qiaomin Xie, Zhi Xu

In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP). While MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof in the seminal works is incomplete. This is due to the fact that the variant, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in literature, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary MAB. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a "policy improvement" operator: it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an $\varepsilon$ approximation of the value function with respect to $\ell_\infty$ norm, MCTS combined with nearest neighbor requires a sample size scaling as $\widetilde{O}\big(\varepsilon^{-(d+4)}\big)$, where $d$ is the dimension of the state space. This is nearly optimal due to a minimax lower bound of $\widetildeΩ\big(\varepsilon^{-(d+2)}\big)$.

MLDec 31, 2018
Learning RUMs: Reducing Mixture to Single Component via PCA

Devavrat Shah, Dogyoon Song

We consider the problem of learning a mixture of Random Utility Models (RUMs). Despite the success of RUMs in various domains and the versatility of mixture RUMs to capture the heterogeneity in preferences, there has been only limited progress in learning a mixture of RUMs from partial data such as pairwise comparisons. In contrast, there have been significant advances in terms of learning a single component RUM using pairwise comparisons. In this paper, we aim to bridge this gap between mixture learning and single component learning of RUM by developing a `reduction' procedure. We propose to utilize PCA-based spectral clustering that simultaneously `de-noises' pairwise comparison data. We prove that our algorithm manages to cluster the partial data correctly (i.e., comparisons from the same RUM component are grouped in the same cluster) with high probability even when data is generated from a possibly {\em heterogeneous} mixture of well-separated {\em generic} RUMs. Both the time and the sample complexities scale polynomially in model parameters including the number of items. Two key features in the analysis are in establishing (1) a meaningful upper bound on the sub-Gaussian norm for RUM components embedded into the vector space of pairwise marginals and (2) the robustness of PCA with missing values in the $L_{2, \infty}$ sense, which might be of interest in their own right.

IROct 15, 2018
Regret vs. Bandwidth Trade-off for Recommendation Systems

Linqi Song, Christina Fragouli, Devavrat Shah

We consider recommendation systems that need to operate under wireless bandwidth constraints, measured as number of broadcast transmissions, and demonstrate a (tight for some instances) tradeoff between regret and bandwidth for two scenarios: the case of multi-armed bandit with context, and the case where there is a latent structure in the message space that we can exploit to reduce the learning phase.

LGFeb 25, 2018
Model Agnostic Time Series Analysis via Matrix Estimation

Anish Agarwal, Muhammad Jehangir Amjad, Devavrat Shah et al.

We propose an algorithm to impute and forecast a time series by transforming the observed time series into a matrix, utilizing matrix estimation to recover missing values and de-noise observed entries, and performing linear regression to make predictions. At the core of our analysis is a representation result, which states that for a large model class, the transformed time series matrix is (approximately) low-rank. In effect, this generalizes the widely used Singular Spectrum Analysis (SSA) in time series literature, and allows us to establish a rigorous link between time series analysis and matrix estimation. The key to establishing this link is constructing a Page matrix with non-overlapping entries rather than a Hankel matrix as is commonly done in the literature (e.g., SSA). This particular matrix structure allows us to provide finite sample analysis for imputation and prediction, and prove the asymptotic consistency of our method. Another salient feature of our algorithm is that it is model agnostic with respect to both the underlying time dynamics and the noise distribution in the observations. The noise agnostic property of our approach allows us to recover the latent states when only given access to noisy and partial observations a la a Hidden Markov Model; e.g., recovering the time-varying parameter of a Poisson process without knowing that the underlying process is Poisson. Furthermore, since our forecasting algorithm requires regression with noisy features, our approach suggests a matrix estimation based method - coupled with a novel, non-standard matrix estimation error metric - to solve the error-in-variable regression problem, which could be of interest in its own right. Through synthetic and real-world datasets, we demonstrate that our algorithm outperforms standard software packages (including R libraries) in the presence of missing data as well as high levels of noise.

LGFeb 12, 2018
Q-learning with Nearest Neighbors

Devavrat Shah, Qiaomin Xie

We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a $d$-dimensional state space and the discounted factor $γ\in (0,1)$, given an arbitrary sample path with "covering time" $ L $, we establish that the algorithm is guaranteed to output an $\varepsilon$-accurate estimate of the optimal Q-function using $\tilde{O}\big(L/(\varepsilon^3(1-γ)^7)\big)$ samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as $ \tilde{O}\big(1/\varepsilon^d\big),$ so the sample complexity scales as $\tilde{O}\big(1/\varepsilon^{d+3}\big).$ Indeed, we establish a lower bound that argues that the dependence of $ \tildeΩ\big(1/\varepsilon^{d+2}\big)$ is necessary.