LGMar 19
Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated AlignmentAmir Asiaee, Samhita Pal
Randomized controlled trials (RCTs) are the gold standard for estimating heterogeneous treatment effects, yet they are often underpowered for detecting effect heterogeneity. Large observational studies (OS) can supplement RCTs for conditional average treatment effect (CATE) estimation, but a key barrier is covariate mismatch: the two sources measure different, only partially overlapping, covariates. We propose CALM (Calibrated ALignment under covariate Mismatch), which bypasses imputation by learning embeddings that map each source's features into a common representation space. OS outcome models are transferred to the RCT embedding space and calibrated using trial data, preserving causal identification from randomization. Finite-sample risk bounds decompose into alignment error, outcome-model complexity, and calibration complexity terms, identifying when embedding alignment outperforms imputation. Under the calibration-based linear variant, the framework provides protection against negative transfer; the neural variant can be vulnerable under severe distributional shift. Under sparse linear models, the embedding approach strictly generalizes imputation. Simulations across 51 settings confirm that (i) calibration-based methods are equivalent for linear CATEs, and (ii) the neural embedding variant wins all 22 nonlinear-regime settings with large margins.
LGFeb 5
Fairness Under Group-Conditional Prior Probability Shift: Invariance, Drift, and Target-Aware Post-ProcessingAmir Asiaee, Kaveh Aryan
Machine learning systems are often trained and evaluated for fairness on historical data, yet deployed in environments where conditions have shifted. A particularly common form of shift occurs when the prevalence of positive outcomes changes differently across demographic groups--for example, when disease rates rise faster in one population than another, or when economic conditions affect loan default rates unequally. We study group-conditional prior probability shift (GPPS), where the label prevalence $P(Y=1\mid A=a)$ may change between training and deployment while the feature-generation process $P(X\mid Y,A)$ remains stable. Our analysis yields three main contributions. First, we prove a fundamental dichotomy: fairness criteria based on error rates (equalized odds) are structurally invariant under GPPS, while acceptance-rate criteria (demographic parity) can drift--and we prove this drift is unavoidable for non-trivial classifiers (shift-robust impossibility). Second, we show that target-domain risk and fairness metrics are identifiable without target labels: the invariance of ROC quantities under GPPS enables consistent estimation from source labels and unlabeled target data alone, with finite-sample guarantees. Third, we propose TAP-GPPS, a label-free post-processing algorithm that estimates prevalences from unlabeled data, corrects posteriors, and selects thresholds to satisfy demographic parity in the target domain. Experiments validate our theoretical predictions and demonstrate that TAP-GPPS achieves target fairness with minimal utility loss.
LGMar 2
CausalWrap: Model-Agnostic Causal Constraint Wrappers for Tabular Synthetic DataAmir Asiaee, Zhuohui J. Liang, Chao Yan
Tabular synthetic data generators are typically trained to match observational distributions, which can yield high conventional utility (e.g., column correlations, predictive accuracy) yet poor preservation of structural relations relevant to causal analysis and out-of-distribution (OOD) reasoning. When the downstream use of synthetic data involves causal reasoning -- estimating treatment effects, evaluating policies, or testing mediation pathways -- merely matching the observational distribution is insufficient: structural fidelity and treatment-mechanism preservation become essential. We propose CausalWrap (CW), a model-agnostic wrapper that injects partial causal knowledge (PCK) -- trusted edges, forbidden edges, and qualitative/monotonic constraints -- into any pretrained base generator (GAN, VAE, or diffusion model), without requiring access to its internals. CW learns a lightweight, differentiable post-hoc correction map applied to samples from the base generator, optimized with causal penalty terms under an augmented-Lagrangian schedule. We provide theoretical results connecting penalty-based optimization to constraint satisfaction and relating approximate factorization to joint distributional control. We validate CW on simulated structural causal models (SCMs) with known ground-truth interventions, semi-synthetic causal benchmarks (IHDP and an ACIC-style suite), and a real-world ICU cohort (MIMIC-IV) with expert-elicited partial graphs. CW improves causal fidelity across diverse base generators -- e.g., reducing average treatment effect (ATE) error by up to 63% on ACIC and lifting ATE agreement from 0.00 to 0.38 on the intensive care unit (ICU) cohort -- while largely retaining conventional utility.
LGFeb 5
Projected Boosting with Fairness Constraints: Quantifying the Cost of Fair Training DistributionsAmir Asiaee, Kaveh Aryan
Boosting algorithms enjoy strong theoretical guarantees: when weak learners maintain positive edge, AdaBoost achieves geometric decrease of exponential loss. We study how to incorporate group fairness constraints into boosting while preserving analyzable training dynamics. Our approach, FairBoost, projects the ensemble-induced exponential-weights distribution onto a convex set of distributions satisfying fairness constraints (as a reweighting surrogate), then trains weak learners on this fair distribution. The key theoretical insight is that projecting the training distribution reduces the effective edge of weak learners by a quantity controlled by the KL-divergence of the projection. We prove an exponential-loss bound where the convergence rate depends on weak learner edge minus a "fairness cost" term $δ_t = \sqrt{\mathrm{KL}(w^t \| q^t)/2}$. This directly quantifies the accuracy-fairness tradeoff in boosting dynamics. Experiments on standard benchmarks validate the theoretical predictions and demonstrate competitive fairness-accuracy tradeoffs with stable training curves.
LGFeb 5
Fix Representation (Optimally) Before Fairness: Finite-Sample Shrinkage Population Correction and the True Price of Fairness Under Subpopulation ShiftAmir Asiaee, Kaveh Aryan
Machine learning practitioners frequently observe tension between predictive accuracy and group fairness constraints -- yet sometimes fairness interventions appear to improve accuracy. We show that both phenomena can be artifacts of training data that misrepresents subgroup proportions. Under subpopulation shift (stable within-group distributions, shifted group proportions), we establish: (i) full importance-weighted correction is asymptotically unbiased but finite-sample suboptimal; (ii) the optimal finite-sample correction is a shrinkage reweighting that interpolates between target and training mixtures; (iii) apparent "fairness helps accuracy" can arise from comparing fairness methods to an improperly-weighted baseline. We provide an actionable evaluation protocol: fix representation (optimally) before fairness -- compare fairness interventions against a shrinkage-corrected baseline to isolate the true, irreducible price of fairness. Experiments on synthetic and real-world benchmarks (Adult, COMPAS) validate our theoretical predictions and demonstrate that this protocol eliminates spurious tradeoffs, revealing the genuine fairness-utility frontier.
LGDec 31, 2025
Causal Discovery with Mixed Latent Confounding via Precision DecompositionAmir Asiaee, Samhita Pal, James O'quinn et al.
We study causal discovery from observational data in linear Gaussian systems affected by \emph{mixed latent confounding}, where some unobserved factors act broadly across many variables while others influence only small subsets. This setting is common in practice and poses a challenge for existing methods: differentiable and score-based DAG learners can misinterpret global latent effects as causal edges, while latent-variable graphical models recover only undirected structure. We propose \textsc{DCL-DECOR}, a modular, precision-led pipeline that separates these roles. The method first isolates pervasive latent effects by decomposing the observed precision matrix into a structured component and a low-rank component. The structured component corresponds to the conditional distribution after accounting for pervasive confounders and retains only local dependence induced by the causal graph and localized confounding. A correlated-noise DAG learner is then applied to this deconfounded representation to recover directed edges while modeling remaining structured error correlations, followed by a simple reconciliation step to enforce bow-freeness. We provide identifiability results that characterize the recoverable causal target under mixed confounding and show how the overall problem reduces to well-studied subproblems with modular guarantees. Synthetic experiments that vary the strength and dimensionality of pervasive confounding demonstrate consistent improvements in directed edge recovery over applying correlated-noise DAG learning directly to the confounded data.
LGMar 2
Partial Causal Structure Learning for Valid Selective Conformal Inference under InterventionsAmir Asiaee, Kavey Aryan, James P. Long
Selective conformal prediction can yield substantially tighter uncertainty sets when we can identify calibration examples that are exchangeable with the test example. In interventional settings, such as perturbation experiments in genomics, exchangeability often holds only within subsets of interventions that leave a target variable "unaffected" (e.g., non-descendants of an intervened node in a causal graph). We study the practical regime where this invariance structure is unknown and must be learned from data. Our contributions are: (i) a contamination-robust conformal coverage theorem that quantifies how misclassification of "unaffected" calibration examples degrades coverage via an explicit function $g(δ,n)$ of the contamination fraction and calibration set size, providing a finite-sample lower bound that holds for arbitrary contaminating distributions; (ii) a task-driven partial causal learning formulation that estimates only the binary descendant indicators $Z_{a,i}=\mathbf{1}\{i\in\mathrm{desc}(a)\}$ needed for selective calibration, rather than the full causal graph; and (iii) algorithms for descendant discovery via perturbation intersection patterns (differentially affected variable set intersections across interventions), and for approximate distance-to-intervention estimation via local invariant causal prediction. We provide recovery conditions under which contamination is controlled. Experiments on synthetic linear structural equation models (SEMs) validate the bound: under controlled contamination up to $δ=0.30$, the corrected procedure maintains $\ge 0.95$ coverage while uncorrected selective CP degrades to $0.867$. A proof-of-concept on Replogle K562 CRISPR interference (CRISPRi) perturbation data demonstrates applicability to real genomic screens.
LGMar 2
Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential FamiliesAmir Asiaee, Samhita Pal
Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.
LGOct 2, 2025
DAG DECORation: Continuous Optimization for Structure Learning under Hidden ConfoundingSamhita Pal, James O'quinn, Kaveh Aryan et al.
We study structure learning for linear Gaussian SEMs in the presence of latent confounding. Existing continuous methods excel when errors are independent, while deconfounding-first pipelines rely on pervasive factor structure or nonlinearity. We propose \textsc{DECOR}, a single likelihood-based and fully differentiable estimator that jointly learns a DAG and a correlated noise model. Our theory gives simple sufficient conditions for global parameter identifiability: if the mixed graph is bow free and the noise covariance has a uniform eigenvalue margin, then the map from $(\B,\OmegaMat)$ to the observational covariance is injective, so both the directed structure and the noise are uniquely determined. The estimator alternates a smooth-acyclic graph update with a convex noise update and can include a light bow complementarity penalty or a post hoc reconciliation step. On synthetic benchmarks that vary confounding density, graph density, latent rank, and dimension with $n<p$, \textsc{DECOR} matches or outperforms strong baselines and is especially robust when confounding is non-pervasive, while remaining competitive under pervasiveness.
LGFeb 10
PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for PredictionAmir Asiaee, Chao Yan, Zachary B. Abrams et al.
Differential privacy (DP) provides a mathematical guarantee limiting what an adversary can learn about any individual from released data. However, achieving this protection typically requires adding noise, and noise can accumulate when many statistics are measured. Existing DP synthetic data methods treat all features symmetrically, spreading noise uniformly even when the data will serve a specific prediction task. We develop a prediction-centric approach operating in three regimes depending on available structural knowledge. In the causal regime, when the causal parents of $Y$ are known and distribution shift is expected, we target the parents for robustness. In the graphical regime, when a Bayesian network structure is available and the distribution is stable, the Markov blanket of $Y$ provides a sufficient feature set for optimal prediction. In the predictive regime, when no structural knowledge exists, we select features via differentially private methods without claiming to recover causal or graphical structure. We formalize this as PRISM, a mechanism that (i) identifies a predictive feature subset according to the appropriate regime, (ii) constructs targeted summary statistics, (iii) allocates budget to minimize an upper bound on prediction error, and (iv) synthesizes data via graphical-model inference. We prove end-to-end privacy guarantees and risk bounds. Empirically, task-aware allocation improves prediction accuracy compared to generic synthesizers. Under distribution shift, targeting causal parents achieves AUC $\approx 0.73$ while correlation-based selection collapses to chance ($\approx 0.49$).
LGFeb 10
Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level InfluenceAmir Asiaee, Chao Yan, Zachary B. Abrams et al.
When synthetic data is released, some individuals are harder to protect than others. A patient with a rare disease combination or a transaction with unusual characteristics stands out from the crowd. Differential privacy provides worst-case guarantees, but empirical attacks -- particularly membership inference -- succeed far more often against such outliers, especially under moderate privacy budgets and with auxiliary information. This paper introduces risk-equalized DP synthesis, a framework that prioritizes protection for high-risk records by reducing their influence on the learned generator. The mechanism operates in two stages: first, a small privacy budget estimates each record's "outlierness"; second, a DP learning procedure weights each record inversely to its risk score. Under Gaussian mechanisms, a record's privacy loss is proportional to its influence on the output -- so deliberately shrinking outliers' contributions yields tighter per-instance privacy bounds for precisely those records that need them most. We prove end-to-end DP guarantees via composition and derive closed-form per-record bounds for the synthesis stage (the scoring stage adds a uniform per-record term). Experiments on simulated data with controlled outlier injection show that risk-weighting substantially reduces membership inference success against high-outlierness records; ablations confirm that targeting -- not random downweighting -- drives the improvement. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, highlighting the interplay between scorer quality and synthesis pipeline.
LGJul 16, 2018
Time Series Deinterleaving of DNS TrafficAmir Asiaee, Hardik Goel, Shalini Ghosh et al.
Stream deinterleaving is an important problem with various applications in the cybersecurity domain. In this paper, we consider the specific problem of deinterleaving DNS data streams using machine-learning techniques, with the objective of automating the extraction of malware domain sequences. We first develop a generative model for user request generation and DNS stream interleaving. Based on these we evaluate various inference strategies for deinterleaving including augmented HMMs and LSTMs on synthetic datasets. Our results demonstrate that state-of-the-art LSTMs outperform more traditional augmented HMMs in this application domain.
MLJun 11, 2018
High Dimensional Data Enrichment: Interpretable, Fast, and Data-EfficientAmir Asiaee, Samet Oymak, Kevin R. Coombes et al.
We consider the problem of multi-task learning in the high dimensional setting. In particular, we introduce an estimator and investigate its statistical and computational properties for the problem of multiple connected linear regressions known as Data Enrichment/Sharing. The between-tasks connections are captured by a cross-tasks \emph{common parameter}, which gets refined by per-task \emph{individual parameters}. Any convex function, e.g., norm, can characterize the structure of both common and individual parameters. We delineate the sample complexity of our estimator and provide a high probability non-asymptotic bound for estimation error of all parameters under a geometric condition. We show that the recovery of the common parameter benefits from \emph{all} of the pooled samples. We propose an iterative estimation algorithm with a geometric convergence rate and supplement our theoretical analysis with experiments on synthetic data. Overall, we present a first thorough statistical and computational analysis of inference in the data-sharing model.