LGOct 19, 2023
Data Augmentations for Improved (Large) Language Model GeneralizationAmir Feder, Yoav Wald, Claudia Shi et al.
The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.
LGJul 21, 2022
JAWS: Auditing Predictive Uncertainty Under Covariate ShiftDrew Prinster, Anqi Liu, Suchi Saria
We propose \textbf{JAWS}, a series of wrapper methods for distribution-free uncertainty quantification tasks under covariate shift, centered on the core method \textbf{JAW}, the \textbf{JA}ckknife+ \textbf{W}eighted with data-dependent likelihood-ratio weights. JAWS also includes computationally efficient \textbf{A}pproximations of JAW using higher-order influence functions: \textbf{JAWA}. Theoretically, we show that JAW relaxes the jackknife+'s assumption of data exchangeability to achieve the same finite-sample coverage guarantee even under covariate shift. JAWA further approaches the JAW guarantee in the limit of the sample size or the influence function order under common regularity assumptions. Moreover, we propose a general approach to repurposing predictive interval-generating methods and their guarantees to the reverse task: estimating the probability that a prediction is erroneous, based on user-specified error criteria such as a safe or acceptable tolerance threshold around the true label. We then propose \textbf{JAW-E} and \textbf{JAWA-E} as the repurposed proposed methods for this \textbf{E}rror assessment task. Practically, JAWS outperform state-of-the-art predictive inference baselines in a variety of biased real world data sets for interval-generation and error-assessment predictive uncertainty auditing tasks.
AIMar 2
Conformal Policy ControlDrew Prinster, Clara Fannjiang, Ji Won Park et al.
An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
LGDec 1, 2025
Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient SolutionShravan Chaudhari, Yoav Wald, Suchi Saria
As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call \emph{the background distribution}, is fixed. In this paper we develop \ours{}, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make \ours{} scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that \ours{} significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.
85.0LGMay 13
MILM: Large Language Models for Multimodal Irregular Time Series with Informative SamplingHsing-Huan Chung, Shijun Li, Yoav Wald et al.
Multimodal irregular time series (MITS) consist of asynchronous and irregularly sampled observations from heterogeneous numerical and textual channels. In healthcare, for example, patients' electronic health records (EHR) include irregular lab measurements and clinical notes. The irregular timing and channel patterns of observations carry predictive signal alongside the numerical values and textual content. LLMs are natural candidates for processing such heterogeneous data, given their extensive pretrained knowledge spanning textual and numerical domains. We introduce MILM (Multimodal Irregular time series Language Model), which represents MITS as time-ordered triplets in Extensible Markup Language (XML) format and fine-tunes an LLM through a two-stage strategy for MITS classification. The first stage trains on value-redacted MITS to predict from sampling patterns alone, and the second stage trains on full MITS to jointly model sampling patterns and observed values. Our two-stage model (MILM-2S) and its single-stage counterpart (MILM-Direct) achieve the best and second-best average performance on multiple EHR datasets. Further value redaction evaluations confirm that sampling patterns carry predictive signal and that MILM-2S learns to exploit them. In the value pending evaluation we introduce, where some values are unavailable at prediction time, MILM-2S outperforms MILM-Direct by a larger margin compared to standard evaluation. For MILM-2S, preserving the time and channel of value-pending observations as additional sampling information further improves in-hospital mortality prediction.
71.0LGMay 10
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task LearningXing Han, Shravan Chaudhari, Tanvi Ranade et al.
Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.
81.7LGMay 8
On the Invariance and Generality of Neural Scaling LawsXing Han, Ziyin Liu, Suchi Saria et al.
Neural scaling laws establish a predictable relationship between model performance and data or compute, offering crucial guidance for resource allocation in new domains and tasks. Yet such laws are most needed precisely where they are hardest to obtain: fitting one for a new model task pair demands expensive sweeps that typically exhaust the very compute budget the law is meant to economize. This paper poses the research question of how to develop generalizable scaling laws: laws fit once on a well-resourced source domain and reliably transported to new domains where running a full sweep is infeasible, which requires a fundamental understanding of when and why scaling properties change. We address this by identifying the right invariants: scaling laws are preserved under bijective (information-preserving) transformations of the data and modified in predictable, information-theoretically grounded ways under non-bijective transformations that lower its information resolution $ρ$: a single axis along which a law fit in one domain can be transported to another. We validate this across language, vision, and speech, and demonstrate two cross-domain applications: predicting scaling for language models trained on electronic health records from laws fit on general text, and predicting time-series classification scaling under varying levels of noise injection, recovering the data-scaling exponents to within $3\%$ error.
LGFeb 5, 2024
FuseMoE: Mixture-of-Experts Transformers for Fleximodal FusionXing Han, Huy Nguyen, Carl Harris et al.
As machine learning models in critical fields increasingly grapple with multimodal data, they face the dual challenges of handling a wide array of modalities, often incomplete due to missing elements, and the temporal irregularity and sparsity of collected samples. Successfully leveraging this complex data, while overcoming the scarcity of high-quality training samples, is key to improving these models' predictive performance. We introduce ``FuseMoE'', a mixture-of-experts framework incorporated with an innovative gating function. Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories. Theoretically, our unique gating function contributes to enhanced convergence rates, leading to better performance in multiple downstream tasks. The practical utility of FuseMoE in the real world is validated by a diverse set of challenging prediction tasks.
LGMay 10, 2024
Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)Drew Prinster, Samuel Stanton, Anqi Liu et al.
As artificial intelligence (AI) / machine learning (ML) gain widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when such systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction is a promising approach to uncertainty and risk quantification, but prior variants' validity guarantees have assumed some form of ``quasi-exchangeability'' on the data distribution, thereby excluding many types of sequential shifts. In this paper we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones. Although the most general case is exceedingly impractical to compute, for concrete practical applications we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of AI/ML-agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.
CLMar 5, 2024
The Minimum Information about CLinical Artificial Intelligence Checklist for Generative Modeling Research (MI-CLAIM-GEN)Brenda Y. Miao, Irene Y. Chen, Christopher YK Williams et al.
Recent advances in generative models, including large language models (LLMs), vision language models (VLMs), and diffusion models, have accelerated the field of natural language and image processing in medicine and marked a significant paradigm shift in how biomedical models can be developed and deployed. While these models are highly adaptable to new tasks, scaling and evaluating their usage presents new challenges not addressed in previous frameworks. In particular, the ability of these models to produce useful outputs with little to no specialized training data ("zero-" or "few-shot" approaches), as well as the open-ended nature of their outputs, necessitate the development of new guidelines for robust reporting of clinical generative model research. In response to gaps in standards and best practices for the development of clinical AI tools identified by US Executive Order 141103 and several emerging national networks for clinical AI evaluation, we begin to formalize some of these guidelines by building on the original MI-CLAIM checklist. The new checklist, MI-CLAIM-GEN (Table 1), aims to address differences in training, evaluation, interpretability, and reproducibility of new generative models compared to non-generative ("predictive") AI models. This MI-CLAIM-GEN checklist also seeks to clarify cohort selection reporting with unstructured clinical data and adds additional items on alignment with ethical standards for clinical AI research.
LGMay 7, 2025
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal MartingalesDrew Prinster, Xing Han, Anqi Liu et al.
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but also continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task. However, existing approaches are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (e.g., detecting data shifts that violate certain exchangeability or IID assumptions), do not allow for online adaptation in response to shifts, and/or cannot diagnose the cause of degradation or alarm. In this paper, we address these limitations by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution), quickly detect harmful shifts, and diagnose those harmful shifts as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
LGApr 10, 2025
Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph LearningHsing-Huan Chung, Shravan Chaudhari, Xing Han et al.
Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal dependencies between edge events. Prior work justified sinusoidal encodings because their inner products depend on the time spans between events, which are crucial features for modeling inter-event relations. However, sinusoidal encodings inherently lose temporal information due to their many-to-one nature and therefore require high dimensions. In this paper, we rigorously study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high-dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans between events from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. While both encoders can be used simultaneously, our study highlights the often-overlooked advantages of linear time features in modern dynamic graph models. These findings can positively impact the design choices of various dynamic graph learning architectures and eventually benefit temporal network applications such as recommender systems, communication networks, and traffic forecasting.
LGSep 30, 2025
Guiding Mixture-of-Experts with Temporal Multimodal InteractionsXing Han, Hsing-Huan Chung, Joydeep Ghosh et al.
Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.
LGMay 17, 2025
Improving Coverage in Combined Prediction Sets with Weighted p-valuesGina Wong, Drew Prinster, Suchi Saria et al.
Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual $1-α$ coverage inevitably weakens the overall guarantee, typically resulting in $1-2α$ worst-case coverage. In this work, we propose a framework for the weighted aggregation of prediction sets, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the $1-2α$ guarantee of the combined models and the $1-α$ guarantee of an individual model depending on the distribution of weights. Importantly, our framework generalizes to data-dependent weights, as we derive a procedure for weighted aggregation that maintains finite-sample validity even when the weights depend on the data. This extension makes our framework broadly applicable to settings where weights are learned, such as mixture-of-experts (MoE), and we demonstrate through experiments in the MoE setting that our methods achieve adaptive coverage.
OTDec 22, 2021
Beyond Low Earth Orbit: Biomonitoring, Artificial Intelligence, and Precision Space HealthRyan T. Scott, Erik L. Antonsen, Lauren M. Sanders et al.
Human space exploration beyond low Earth orbit will involve missions of significant distance and duration. To effectively mitigate myriad space health hazards, paradigm shifts in data and space health systems are necessary to enable Earth-independence, rather than Earth-reliance. Promising developments in the fields of artificial intelligence and machine learning for biology and health can address these needs. We propose an appropriately autonomous and intelligent Precision Space Health system that will monitor, aggregate, and assess biomedical statuses; analyze and predict personalized adverse health outcomes; adapt and respond to newly accumulated data; and provide preventive, actionable, and timely insights to individual deep space crew members and iterative decision support to their crew medical officer. Here we present a summary of recommendations from a workshop organized by the National Aeronautics and Space Administration, on future applications of artificial intelligence in space biology and health. In the next decade, biomonitoring technology, biomarker science, spacecraft hardware, intelligent software, and streamlined data management must mature and be woven together into a Precision Space Health system to enable humanity to thrive in deep space.
OTDec 22, 2021
Beyond Low Earth Orbit: Biological Research, Artificial Intelligence, and Self-Driving LabsLauren M. Sanders, Jason H. Yang, Ryan T. Scott et al.
Space biology research aims to understand fundamental effects of spaceflight on organisms, develop foundational knowledge to support deep space exploration, and ultimately bioengineer spacecraft and habitats to stabilize the ecosystem of plants, crops, microbes, animals, and humans for sustained multi-planetary life. To advance these aims, the field leverages experiments, platforms, data, and model organisms from both spaceborne and ground-analog studies. As research is extended beyond low Earth orbit, experiments and platforms must be maximally autonomous, light, agile, and intelligent to expedite knowledge discovery. Here we present a summary of recommendations from a workshop organized by the National Aeronautics and Space Administration on artificial intelligence, machine learning, and modeling applications which offer key solutions toward these space biology challenges. In the next decade, the synthesis of artificial intelligence into the field of space biology will deepen the biological understanding of spaceflight effects, facilitate predictive modeling and analytics, support maximally autonomous and reproducible experiments, and efficiently manage spaceborne data and metadata, all with the goal to enable life to thrive in deep space.
MLDec 23, 2020
Partial Identifiability in Discrete Data With Measurement ErrorNoam Finkelstein, Roy Adams, Suchi Saria et al.
When data contains measurement errors, it is necessary to make assumptions relating the observed, erroneous data to the unobserved true phenomena of interest. These assumptions should be justifiable on substantive grounds, but are often motivated by mathematical convenience, for the sake of exactly identifying the target of inference. We adopt the view that it is preferable to present bounds under justifiable assumptions than to pursue exact identification under dubious ones. To that end, we demonstrate how a broad class of modeling assumptions involving discrete variables, including common measurement error and conditional independence assumptions, can be expressed as linear constraints on the parameters of the model. We then use linear programming techniques to produce sharp bounds for factual and counterfactual distributions under measurement error in such models. We additionally propose a procedure for obtaining outer bounds on non-linear models. Our method yields sharp bounds in a number of important settings -- such as the instrumental variable scenario with measurement error -- for which no bounds were previously known.
LGOct 28, 2020
Evaluating Model Robustness and Stability to Dataset ShiftAdarsh Subbaswamy, Roy Adams, Suchi Saria
As the use of machine learning in high impact domains becomes widespread, the importance of evaluating safety has increased. An important aspect of this is evaluating how robust a model is to changes in setting or population, which typically requires applying the model to multiple, independent datasets. Since the cost of collecting such datasets is often prohibitive, in this paper, we propose a framework for analyzing this type of stability using the available data. We use the original evaluation data to determine distributions under which the algorithm performs poorly, and estimate the algorithm's performance on the "worst-case" distribution. We consider shifts in user defined conditional distributions, allowing some distributions to shift while keeping other portions of the data distribution fixed. For example, in a healthcare context, this allows us to consider shifts in clinical practice while keeping the patient population fixed. To address the challenges associated with estimation in complex, high-dimensional distributions, we derive a "debiased" estimator which maintains $\sqrt{N}$-consistency even when machine learning methods with slower convergence rates are used to estimate the nuisance parameters. In experiments on a real medical risk prediction task, we show this estimator can be used to analyze stability and accounts for realistic shifts that could not previously be expressed. The proposed framework allows practitioners to proactively evaluate the safety of their models without requiring additional data collection.
MLFeb 20, 2020
I-SPEC: An End-to-End Framework for Learning Transportable, Shift-Stable ModelsAdarsh Subbaswamy, Suchi Saria
Shifts in environment between development and deployment cause classical supervised learning to produce models that fail to generalize well to new target distributions. Recently, many solutions which find invariant predictive distributions have been developed. Among these, graph-based approaches do not require data from the target environment and can capture more stable information than alternative methods which find stable feature sets. However, these approaches assume that the data generating process is known in the form of a full causal graph, which is generally not the case. In this paper, we propose I-SPEC, an end-to-end framework that addresses this shortcoming by using data to learn a partial ancestral graph (PAG). Using the PAG we develop an algorithm that determines an interventional distribution that is stable to the declared shifts; this subsumes existing approaches which find stable feature sets that are less accurate. We apply I-SPEC to a mortality prediction problem to show it can learn a model that is robust to shifts without needing upfront knowledge of the full causal DAG.
MLMay 27, 2019
A Unifying Causal Framework for Analyzing Dataset Shift-stable Learning AlgorithmsAdarsh Subbaswamy, Bryant Chen, Suchi Saria
Recent interest in the external validity of prediction models (i.e., the problem of different train and test distributions, known as dataset shift) has produced many methods for finding predictive distributions that are invariant to dataset shifts and can be used for prediction in new, unseen environments. However, these methods consider different types of shifts and have been developed under disparate frameworks, making it difficult to theoretically analyze how solutions differ with respect to stability and accuracy. Taking a causal graphical view, we use a flexible graphical representation to express various types of dataset shifts. Given a known graph of the data generating process, we show that all invariant distributions correspond to a causal hierarchy of graphical operators which disable the edges in the graph that are responsible for the shifts. The hierarchy provides a common theoretical underpinning for understanding when and how stability to shifts can be achieved, and in what ways stable distributions can differ. We use it to establish conditions for minimax optimal performance across environments, and derive new algorithms that find optimal stable distributions. Using this new perspective, we empirically demonstrate that that there is a tradeoff between minimax and average performance.
LGApr 15, 2019
Tutorial: Safe and Reliable Machine LearningSuchi Saria, Adarsh Subbaswamy
This document serves as a brief overview of the "Safe and Reliable Machine Learning" tutorial given at the 2019 ACM Conference on Fairness, Accountability, and Transparency (FAT* 2019). The talk slides can be found here: https://bit.ly/2Gfsukp, while a video of the talk is available here: https://youtu.be/FGLOCkC4KmE, and a complete list of references for the tutorial here: https://bit.ly/2GdLPme.
MLApr 10, 2019
Active Learning for Decision-Making from Imbalanced Observational DataIiris Sundin, Peter Schulam, Eero Siivola et al.
Machine learning can help personalized decision support by learning models to predict individual treatment effects (ITE). This work studies the reliability of prediction-based decision-making in a task of deciding which action $a$ to take for a target unit after observing its covariates $\tilde{x}$ and predicted outcomes $\hat{p}(\tilde{y} \mid \tilde{x}, a)$. An example case is personalized medicine and the decision of which treatment to give to a patient. A common problem when learning these models from observational data is imbalance, that is, difference in treated/control covariate distributions, which is known to increase the upper bound of the expected ITE estimation error. We propose to assess the decision-making reliability by estimating the ITE model's Type S error rate, which is the probability of the model inferring the sign of the treatment effect wrong. Furthermore, we use the estimated reliability as a criterion for active learning, in order to collect new (possibly expensive) observations, instead of making a forced choice based on unreliable predictions. We demonstrate the effectiveness of this decision-making aware active learning in two decision-making tasks: in simulated data with binary outcomes and in a medical dataset with synthetic and continuous treatment outcomes.
MLJan 25, 2019
Learning Models from Data with Measurement Error: Tackling UnderreportingRoy Adams, Yuelong Ji, Xiaobin Wang et al.
Measurement error in observational datasets can lead to systematic bias in inferences based on these datasets. As studies based on observational data are increasingly used to inform decisions with real-world impact, it is critical that we develop a robust set of techniques for analyzing and adjusting for these biases. In this paper we present a method for estimating the distribution of an outcome given a binary exposure that is subject to underreporting. Our method is based on a missing data view of the measurement error problem, where the true exposure is treated as a latent variable that is marginalized out of a joint model. We prove three different conditions under which the outcome distribution can still be identified from data containing only error-prone observations of the exposure. We demonstrate this method on synthetic data and analyze its sensitivity to near violations of the identifiability conditions. Finally, we use this method to estimate the effects of maternal smoking and opioid use during pregnancy on childhood obesity, two import problems from public health. Using the proposed method, we estimate these effects using only subject-reported drug use data and substantially refine the range of estimates generated by a sensitivity analysis-based approach. Further, the estimates produced by our method are consistent with existing literature on both the effects of maternal smoking and the rate at which subjects underreport smoking.
MLJan 2, 2019
Can You Trust This Prediction? Auditing Pointwise Reliability After LearningPeter Schulam, Suchi Saria
To use machine learning in high stakes applications (e.g. medicine), we need tools for building confidence in the system and evaluating whether it is reliable. Methods to improve model reliability often require new learning algorithms (e.g. using Bayesian inference to obtain uncertainty estimates). An alternative is to audit a model after it is trained. In this paper, we describe resampling uncertainty estimation (RUE), an algorithm to audit the pointwise reliability of predictions. Intuitively, RUE estimates the amount that a prediction would change if the model had been fit on different training data. The algorithm uses the gradient and Hessian of the model's loss function to create an ensemble of predictions. Experimentally, we show that RUE more effectively detects inaccurate predictions than existing tools for auditing reliability subsequent to training. We also show that RUE can create predictive distributions that are competitive with state-of-the-art methods like Monte Carlo dropout, probabilistic backpropagation, and deep ensembles, but does not depend on specific algorithms at train-time like these methods do.
MLDec 11, 2018
Preventing Failures Due to Dataset Shift: Learning Predictive Models That TransportAdarsh Subbaswamy, Peter Schulam, Suchi Saria
Classical supervised learning produces unreliable models when training and target distributions differ, with most existing solutions requiring samples from the target domain. We propose a proactive approach which learns a relationship in the training domain that will generalize to the target domain by incorporating prior knowledge of aspects of the data generating process that are expected to differ as expressed in a causal selection diagram. Specifically, we remove variables generated by unstable mechanisms from the joint factorization to yield the Surgery Estimator---an interventional distribution that is invariant to the differences across environments. We prove that the surgery estimator finds stable relationships in strictly more scenarios than previous approaches which only consider conditional relationships, and demonstrate this in simulated experiments. We also evaluate on real world data for which the true causal diagram is unknown, performing competitively against entirely data-driven approaches.
MLOct 6, 2018
Discretizing Logged Interaction Data Biases Learning for Decision-MakingPeter Schulam, Suchi Saria
Time series data that are not measured at regular intervals are commonly discretized as a preprocessing step. For example, data about customer arrival times might be simplified by summing the number of arrivals within hourly intervals, which produces a discrete-time time series that is easier to model. In this abstract, we show that discretization introduces a bias that affects models trained for decision-making. We refer to this phenomenon as discretization bias, and show that we can avoid it by using continuous-time models instead.
MLAug 9, 2018
Counterfactual Normalization: Proactively Addressing Dataset Shift and Improving Reliability Using Causal MechanismsAdarsh Subbaswamy, Suchi Saria
Predictive models can fail to generalize from training to deployment environments because of dataset shift, posing a threat to model reliability and the safety of downstream decisions made in practice. Instead of using samples from the target distribution to reactively correct dataset shift, we use graphical knowledge of the causal mechanisms relating variables in a prediction problem to proactively remove relationships that do not generalize across environments, even when these relationships may depend on unobserved variables (violations of the "no unobserved confounders" assumption). To accomplish this, we identify variables with unstable paths of statistical influence and remove them from the model. We also augment the causal graph with latent counterfactual variables that isolate unstable paths of statistical influence, allowing us to retain stable paths that would otherwise be removed. Our experiments demonstrate that models that remove vulnerable variables and use estimates of the latent variables transfer better, often outperforming in the target domain despite some accuracy loss in the training domain.
MLAug 16, 2017
Scalable Joint Models for Reliable Uncertainty-Aware Event PredictionHossein Soleimani, James Hensman, Suchi Saria
Missing data and noisy observations pose significant challenges for reliably predicting events from irregularly sampled multivariate time series (longitudinal) data. Imputation methods, which are typically used for completing the data prior to event prediction, lack a principled mechanism to account for the uncertainty due to missingness. Alternatively, state-of-the-art joint modeling techniques can be used for jointly modeling the longitudinal and event data and compute event probabilities conditioned on the longitudinal observations. These approaches, however, make strong parametric assumptions and do not easily scale to multivariate signals with many observations. Our proposed approach consists of several key innovations. First, we develop a flexible and scalable joint model based upon sparse multiple-output Gaussian processes. Unlike state-of-the-art joint models, the proposed model can explain highly challenging structure including non-Gaussian noise while scaling to large data. Second, we derive an optimal policy for predicting events using the distribution of the event occurrence estimated by the joint model. The derived policy trades-off the cost of a delayed detection versus incorrect assessments and abstains from making decisions when the estimated event probability does not satisfy the derived confidence criteria. Experiments on a large dataset show that the proposed framework significantly outperforms state-of-the-art techniques in event prediction.
MLApr 6, 2017
Treatment-Response Models for Counterfactual Reasoning with Continuous-time, Continuous-valued InterventionsHossein Soleimani, Adarsh Subbaswamy, Suchi Saria
Treatment effects can be estimated from observational data as the difference in potential outcomes. In this paper, we address the challenge of estimating the potential outcome when treatment-dose levels can vary continuously over time. Further, the outcome variable may not be measured at a regular frequency. Our proposed solution represents the treatment response curves using linear time-invariant dynamical systems---this provides a flexible means for modeling response over time to highly variable dose curves. Moreover, for multivariate data, the proposed method: uncovers shared structure in treatment response and the baseline across multiple markers; and, flexibly models challenging correlation structure both across and within signals over time. For this, we build upon the framework of multiple-output Gaussian Processes. On simulated and a challenging clinical dataset, we show significant gains in accuracy over state-of-the-art models.
MLMar 30, 2017
Reliable Decision Support using Counterfactual ModelsPeter Schulam, Suchi Saria
Decision-makers are faced with the challenge of estimating what is likely to happen when they take an action. For instance, if I choose not to treat this patient, are they likely to die? Practitioners commonly use supervised learning algorithms to fit predictive models that help decision-makers reason about likely future outcomes, but we show that this approach is unreliable, and sometimes even dangerous. The key issue is that supervised learning algorithms are highly sensitive to the policy used to choose actions in the training data, which causes the model to capture relationships that do not generalize. We propose using a different learning objective that predicts counterfactuals instead of predicting outcomes under an existing action policy as in supervised learning. To support decision-making in temporal settings, we introduce the Counterfactual Gaussian Process (CGP) to predict the counterfactual future progression of continuous-time trajectories under sequences of future actions. We demonstrate the benefits of the CGP on two important decision-support tasks: risk prediction and "what if?" reasoning for individualized treatment planning.
LGAug 18, 2016
A Bayesian Nonparametric Approach for Estimating Individualized Treatment-Response CurvesYanbo Xu, Yanxun Xu, Suchi Saria
We study the problem of estimating the continuous response over time to interventions using observational time series---a retrospective dataset where the policy by which the data are generated is unknown to the learner. We are motivated by applications where response varies by individuals and therefore, estimating responses at the individual-level is valuable for personalizing decision-making. We refer to this as the problem of estimating individualized treatment response (ITR) curves. In statistics, G-computation formula (Robins, 1986) has been commonly used for estimating treatment responses from observational data containing sequential treatment assignments. However, past studies have focused predominantly on obtaining point-in-time estimates at the population level. We leverage the G-computation formula and develop a novel Bayesian nonparametric (BNP) method that can flexibly model functional data and provide posterior inference over the treatment response curves at both the individual and population level. On a challenging dataset containing time series from patients admitted to a hospital, we estimate responses to treatments used in managing kidney function and show that the resulting fits are more accurate than alternative approaches. Accurate methods for obtaining ITRs from observational data can dramatically accelerate the pace at which personalized treatment plans become possible.
MLApr 20, 2016
Trading-Off Cost of Deployment Versus Accuracy in Learning Predictive ModelsDaniel P. Robinson, Suchi Saria
Predictive models are finding an increasing number of applications in many industries. As a result, a practical means for trading-off the cost of deploying a model versus its effectiveness is needed. Our work is motivated by risk prediction problems in healthcare. Cost-structures in domains such as healthcare are quite complex, posing a significant challenge to existing approaches. We propose a novel framework for designing cost-sensitive structured regularizers that is suitable for problems with complex cost dependencies. We draw upon a surprising connection to boolean circuits. In particular, we represent the problem costs as a multi-layer boolean circuit, and then use properties of boolean circuits to define an extended feature vector and a group regularizer that exactly captures the underlying cost structure. The resulting regularizer may then be combined with a fidelity function to perform model prediction, for example. For the challenging real-world application of risk prediction for sepsis in intensive care units, the use of our regularizer leads to models that are in harmony with the underlying cost structure and thus provide an excellent prediction accuracy versus cost tradeoff.
MLJan 18, 2016
A Framework for Individualizing Predictions of Disease Trajectories by Exploiting Multi-Resolution StructurePeter Schulam, Suchi Saria
For many complex diseases, there is a wide variety of ways in which an individual can manifest the disease. The challenge of personalized medicine is to develop tools that can accurately predict the trajectory of an individual's disease, which can in turn enable clinicians to optimize treatments. We represent an individual's disease trajectory as a continuous-valued continuous-time function describing the severity of the disease over time. We propose a hierarchical latent variable model that individualizes predictions of disease trajectories. This model shares statistical strength across observations at different resolutions--the population, subpopulation and the individual level. We describe an algorithm for learning population and subpopulation parameters offline, and an online procedure for dynamically learning individual-specific parameters. Finally, we validate our model on the task of predicting the course of interstitial lung disease, a leading cause of death among patients with the autoimmune disease scleroderma. We compare our approach against state-of-the-art and demonstrate significant improvements in predictive accuracy.
CVDec 18, 2015
Deformable Distributed Multiple Detector Fusion for Multi-Person TrackingAndy J Ma, Pong C Yuen, Suchi Saria
This paper addresses fully automated multi-person tracking in complex environments with challenging occlusion and extensive pose variations. Our solution combines multiple detectors for a set of different regions of interest (e.g., full-body and head) for multi-person tracking. The use of multiple detectors leads to fewer miss detections as it is able to exploit the complementary strengths of the individual detectors. While the number of false positives may increase with the increased number of bounding boxes detected from multiple detectors, we propose to group the detection outputs by bounding box location and depth information. For robustness to significant pose variations, deformable spatial relationship between detectors are learnt in our multi-person tracking system. On RGBD data from a live Intensive Care Unit (ICU), we show that the proposed method significantly improves multi-person tracking performance over state-of-the-art methods.
AIJul 27, 2015
Learning (Predictive) Risk Scores in the Presence of Censoring due to InterventionsKirill Dyagilev, Suchi Saria
A large and diverse set of measurements are regularly collected during a patient's hospital stay to monitor their health status. Tools for integrating these measurements into severity scores, that accurately track changes in illness severity, can improve clinicians ability to provide timely interventions. Existing approaches for creating such scores either 1) rely on experts to fully specify the severity score, or 2) train a predictive score, using supervised learning, by regressing against a surrogate marker of severity such as the presence of downstream adverse events. The first approach does not extend to diseases where an accurate score cannot be elicited from experts. The second approach often produces scores that suffer from bias due to treatment-related censoring (Paxton, 2013). We propose a novel ranking based framework for disease severity score learning (DSSL). DSSL exploits the following key observation: while it is challenging for experts to quantify the disease severity at any given time, it is often easy to compare the disease severity at two different times. Extending existing ranking algorithms, DSSL learns a function that maps a vector of patient's measurements to a scalar severity score such that the resulting score is temporally smooth and consistent with the expert's ranking of pairs of disease states. We apply DSSL to the problem of learning a sepsis severity score using a large, real-world dataset. The learned scores significantly outperform state-of-the-art clinical scores in ranking patient states by severity and in early detection of future adverse events. We also show that the learned disease severity trajectories are consistent with clinical expectations of disease evolution. Further, using simulated datasets, we show that DSSL exhibits better generalization performance to changes in treatment patterns compared to the above approaches.
AIJun 20, 2012
Reasoning at the Right Time GranularitySuchi Saria, Uri Nodelman, Daphne Koller
Most real-world dynamic systems are composed of different components that often evolve at very different rates. In traditional temporal graphical models, such as dynamic Bayesian networks, time is modeled at a fixed granularity, generally selected based on the rate at which the fastest component evolves. Inference must then be performed at this fastest granularity, potentially at significant computational cost. Continuous Time Bayesian Networks (CTBNs) avoid time-slicing in the representation by modeling the system as evolving continuously over time. The expectation-propagation (EP) inference algorithm of Nodelman et al. (2005) can then vary the inference granularity over time, but the granularity is uniform across all parts of the system, and must be selected in advance. In this paper, we provide a new EP algorithm that utilizes a general cluster graph architecture where clusters contain distributions that can overlap in both space (set of variables) and time. This architecture allows different parts of the system to be modeled at very different time granularities, according to their current rate of evolution. We also provide an information-theoretic criterion for dynamically re-partitioning the clusters during inference to tune the level of approximation to the current rate of evolution. This avoids the need to hand-select the appropriate granularity, and allows the granularity to adapt as information is transmitted across the network. We present experiments demonstrating that this approach can result in significant computational savings.