Jonas Peters

ML
h-index45
46papers
6,309citations
Novelty55%
AI Score60

46 Papers

85.6MLJun 4
Anchor PCA

Benedikt Seiter, Anya Fries, Julius von Kügelgen et al.

Principal component analysis (PCA) is one of the most widely used unsupervised dimension reduction techniques. We study PCA for data from multiple related domains. Since principal components generally differ across domains, one way to obtain a shared low-rank embedding is to perform PCA on the pooled data. However, this approach can focus on spurious directions that exhibit high variation in only a few domains. To find a robust embedding that still explains most variance in unseen but similar domains, we propose instead to focus on shared directions of variation. To this end, we introduce Anchor PCA which trades off overall explained variance with agreement between the shared and domain-specific low-rank embeddings. Anchor PCA amounts to PCA on a modified target matrix and thus can be solved efficiently. Moreover, we show that Anchor PCA recovers a maximal invariant subspace and admits a minimax reconstruction interpretation under bounded domain-specific covariance inflations. On simulated and real-world gas sensor data with temporal drift, we demonstrate, respectively, that Anchor PCA recovers the maximally invariant subspace and yields embeddings that explain more variance on unseen domains than the pooling baseline and a worst-case alternative. Taken together, these findings establish Anchor PCA as a promising approach to robust unsupervised dimension reduction from multi-domain data.

MLFeb 19
Anti-causal domain generalization: Leveraging unlabeled data

Sorawit Saengkyongam, Juan L. Gamella, Andrew C. Miller et al. · eth-zurich

The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anti-causal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model's sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model's sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.

MLJun 19, 2023
Effect-Invariant Mechanisms for Policy Generalization

Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja et al.

Policy learning is an important component of many real-world learning systems. A major challenge in policy learning is how to adapt efficiently to unseen environments or tasks. Recently, it has been suggested to exploit invariant conditional distributions to learn models that generalize better to unseen environments. However, assuming invariance of entire conditional distributions (which we call full invariance) may be too strong of an assumption in practice. In this paper, we introduce a relaxation of full invariance called effect-invariance (e-invariance for short) and prove that it is sufficient, under suitable assumptions, for zero-shot policy generalization. We also discuss an extension that exploits e-invariance when we have a small sample from the test environment, enabling few-shot policy generalization. Our work does not assume an underlying causal graph or that the data are generated by a structural causal model; instead, we develop testing procedures to test e-invariance directly from data. We present empirical results using simulated data and a mobile health intervention dataset to demonstrate the effectiveness of our approach.

LGOct 6, 2023
Identifying Representations for Intervention Extrapolation

Sorawit Saengkyongam, Elan Rosenfeld, Pradeep Ravikumar et al.

The premise of identifiable and causal representation learning is to improve the current representation learning paradigm in terms of generalizability or robustness. Despite recent progress in questions of identifiability, more theoretical results demonstrating concrete advantages of these methods for downstream tasks are needed. In this paper, we consider the task of intervention extrapolation: predicting how interventions affect an outcome, even when those interventions are not observed at training time, and show that identifiable representations can provide an effective solution to this task even if the interventions affect the outcome non-linearly. Our setup includes an outcome Y, observed features X, which are generated as a non-linear transformation of latent features Z, and exogenous action variables A, which influence Z. The objective of intervention extrapolation is to predict how interventions on A that lie outside the training support of A affect Y. Here, extrapolation becomes possible if the effect of A on Z is linear and the residual when regressing Z on A has full support. As Z is latent, we combine the task of intervention extrapolation with identifiable representation learning, which we call Rep4Ex: we aim to map the observed features X into a subspace that allows for non-linear extrapolation in A. We show that the hidden representation is identifiable up to an affine transformation in Z-space, which is sufficient for intervention extrapolation. The identifiability is characterized by a novel constraint describing the linearity assumption of A on Z. Based on this insight, we propose a method that enforces the linear invariance constraint and can be combined with any type of autoencoder. We validate our theoretical findings through synthetic experiments and show that our approach succeeds in predicting the effects of unseen interventions.

MLJun 1, 2023
Unfair Utilities and First Steps Towards Improving Them

Frederik Hytting Jørgensen, Sebastian Weichwald, Jonas Peters

Many fairness criteria constrain the policy or choice of predictors, which can have unwanted consequences, in particular, when optimizing the policy under such constraints. Here, we advocate to instead focus on the utility function the policy is optimizing for. We define value of information fairness and propose to not use utility functions that violate this criterion. This principle suggests to modify these utility functions such that they satisfy value of information fairness. We describe how this can be done and discuss consequences for the corresponding optimal policies. We apply our framework to thought experiments and the COMPAS data. Focussing on the utility function provides better answers than existing fairness notions: We are not aware of any intuitively fair policy that is disallowed by value of information fairness, and when we find that value of information fairness recommends an intuitively unfair policy, no existing fairness notion finds an intuitively fair policy.

MLOct 9, 2023
Boosted Control Functions: Distribution generalization and invariance in confounded models

Nicola Gnecco, Jonas Peters, Sebastian Engelke et al.

Modern machine learning methods and the availability of large-scale data have significantly advanced our ability to predict target quantities from large sets of covariates. However, these methods often struggle under distributional shifts, particularly in the presence of hidden confounding. While the impact of hidden confounding is well-studied in causal effect estimation, e.g., instrumental variables, its implications for prediction tasks under shifting distributions remain underexplored. This work addresses this gap by introducing a strong notion of invariance that, unlike existing weaker notions, allows for distribution generalization even in the presence of nonlinear, non-identifiable structural functions. Central to this framework is the Boosted Control Function (BCF), a novel, identifiable target of inference that satisfies the proposed strong invariance notion and is provably worst-case optimal under distributional shifts. The theoretical foundation of our work lies in Simultaneous Equation Models for Distribution Generalization (SIMDGs), which bridge machine learning with econometrics by describing data-generating processes under distributional shifts. To put these insights into practice, we propose the ControlTwicing algorithm to estimate the BCF using nonparametric machine-learning techniques and study its generalization performance on synthetic and real-world datasets compared to robust and empirical risk minimization approaches.

AIApr 17, 2024Code
The Causal Chambers: Real Physical Systems as a Testbed for AI Methodology

Juan L. Gamella, Jonas Peters, Peter Bühlmann · eth-zurich

In some fields of AI, machine learning and statistics, the validation of new methods and algorithms is often hindered by the scarcity of suitable real-world datasets. Researchers must often turn to simulated data, which yields limited information about the applicability of the proposed methods to real problems. As a step forward, we have constructed two devices that allow us to quickly and inexpensively produce large datasets from non-trivial but well-understood physical systems. The devices, which we call causal chambers, are computer-controlled laboratories that allow us to manipulate and measure an array of variables from these physical systems, providing a rich testbed for algorithms from a variety of fields. We illustrate potential applications through a series of case studies in fields such as causal discovery, out-of-distribution generalization, change point detection, independent component analysis, and symbolic regression. For applications to causal inference, the chambers allow us to carefully perform interventions. We also provide and empirically validate a causal model of each chamber, which can be used as ground truth for different tasks. All hardware and software is made open source, and the datasets are publicly available at causalchamber.org or through the Python package causalchamber.

43.9LGMay 19
FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes

Anya Fries, Jacob A Nelson, Martin Jung et al.

We introduce FLUXtrapolation, a benchmark for extrapolating ecosystem fluxes under progressively harder distribution shifts. Ecosystem fluxes are central to understanding the carbon, water, and energy cycles, yet they can only be measured directly at sparsely located measurement towers. Producing global flux estimates therefore requires training models on observed sites using globally available covariates and predicting in unobserved regions, that is, upscaling. Flux upscaling is a challenging domain generalization problem that is affected by a shift in covariate distribution across climates, ecosystem types, and environmental conditions, as well as by conditional shift: important drivers remain unobserved at global scale. We provide a quantitative analysis of both these shifts in $P_X$ and $P_{Y\mid X}$. FLUXtrapolation is designed based on domain expertise on flux upscaling: it defines temporal, spatial, and temperature-based extrapolation scenarios and evaluates performance across held-out domains, temporal aggregations, and tail errors. In a pilot study, we find that baselines perform similarly under median hourly RMSE, but separate under the proposed tail-focused and multi-scale evaluation. FLUXtrapolation therefore poses a realistic and thus relevant challenge for machine learning methods under distribution shift; at the same time, progress on this benchmark would directly support the scientific goal of improving flux upscaling.

MLJan 21
Many Experiments, Few Repetitions, Unpaired Data, and Sparse Effects: Is Causal Inference Possible?

Felix Schur, Niklas Pfister, Peng Ding et al.

We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates $X$ and an outcome $Y$ under different experimental conditions (environments) but do not observe them jointly; we either observe $X$ or $Y$. Under appropriate regularity conditions, the problem can be cast as an instrumental variable (IV) regression with the environment acting as a (possibly high-dimensional) instrument. When there are many environments but only a few observations per environment, standard two-sample IV estimators fail to be consistent. We propose a GMM-type estimator based on cross-fold sample splitting of the instrument-covariate sample and prove that it is consistent as the number of environments grows but the sample size per environment remains constant. We further extend the method to sparse causal effects via $\ell_1$-regularized estimation and post-selection refitting.

MLMar 4
Invariance-Based Dynamic Regret Minimization

Margherita Lazzaretto, Jonas Peters, Niklas Pfister

We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.

63.7MLMay 16
Prediction-Intervention Games and Invariant Sets

Linus Kühne, Felix Schur, Jonas Peters

We consider the following two-player game: using observational data, the leader chooses a prediction function for a response variable $Y$ from given covariates. The follower then reacts with an intervention on some covariates in the underlying structural causal model to maximize their own objective. The leader knows the intervention targets, but may have limited knowledge of the follower's objective. We call this setup a prediction-intervention game, a special case of a Stackelberg game. Finding an optimal strategy for the leader is generally difficult. To avoid severe performance loss, the leader may base their prediction on the causal parents of $Y$, or more generally on an invariant subset of covariates. We prove, for two common classes of follower objectives, that predictors based on the stable blanket, a specific invariant subset, are always better or as good as those based on the causal parents. We further upper bound the leader's post-intervention risk by a worst-case risk over allowed interventions and strengthen existing distribution generalization results to analyze this bound: we give sufficient conditions under which stable-blanket predictors are worst-case optimal, and show by examples that these conditions cannot in general be dropped. Finally, we discuss practical strategies for settings with known and unknown graph, and test them on simulated and real-world data.

LGFeb 12, 2022Code
Learning by Doing: Controlling a Dynamical System using Causality, Control, and Reinforcement Learning

Sebastian Weichwald, Søren Wengel Mogensen, Tabitha Edith Lee et al.

Questions in causality, control, and reinforcement learning go beyond the classical machine learning task of prediction under i.i.d. observations. Instead, these fields consider the problem of learning how to actively perturb a system to achieve a certain effect on a response variable. Arguably, they have complementary views on the problem: In control, one usually aims to first identify the system by excitation strategies to then apply model-based design techniques to control the system. In (non-model-based) reinforcement learning, one directly optimizes a reward. In causality, one focus is on identifiability of causal structure. We believe that combining the different views might create synergies and this competition is meant as a first step toward such synergies. The participants had access to observational and (offline) interventional data generated by dynamical systems. Track CHEM considers an open-loop problem in which a single impulse at the beginning of the dynamics can be set, while Track ROBO considers a closed-loop problem in which control variables can be set at each time step. The goal in both tracks is to infer controls that drive the system to a desired state. Code is open-sourced ( https://github.com/LearningByDoingCompetition/learningbydoing-comp ) to reproduce the winning solutions of the competition and to facilitate trying out new methods on the competition tasks.

MLJun 5, 2013Code
Structural Intervention Distance (SID) for Evaluating Causal Graphs

Jonas Peters, Peter Bühlmann

Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Different graphs may result in different causal inference statements and different intervention distributions. To quantify such differences, we propose a (pre-) distance between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quantifies the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore well-suited for evaluating graphs that are used for computing interventions. Instead of DAGs it is also possible to compare CPDAGs, completed partially directed acyclic graphs that represent Markov equivalence classes. Since it differs significantly from the popular Structural Hamming Distance (SHD), the SID constitutes a valuable additional measure. We discuss properties of this distance and provide an efficient implementation with software code available on the first author's homepage (an R package is under construction).

42.2MLMar 11
Worst-case low-rank approximations

Anya Fries, Markus Reichstein, David Blei et al.

Real-world data in health, economics, and environmental sciences are often collected across heterogeneous domains (such as hospitals, regions, or time periods). In such settings, distributional shifts can make standard PCA unreliable, in that, for example, the leading principal components may explain substantially less variance in unseen domains than in the training domains. Existing approaches (such as FairPCA) have proposed to consider worst-case (rather than average) performance across multiple domains. This work develops a unified framework, called wcPCA, applies it to other objectives (resulting in the novel estimators such as norm-minPCA and norm-maxregret, which are better suited for applications with heterogeneous total variance) and analyzes their relationship. We prove that for all objectives, the estimators are worst-case optimal not only over the observed source domains but also over all target domains whose covariance lies in the convex hull of the (possibly normalized) source covariances. We establish consistency and asymptotic worst-case guarantees of empirical estimators. We extend our methodology to matrix completion, another problem that makes use of low-rank approximations, and prove approximate worst-case optimality for inductive matrix completion. Simulations and two real-world applications on ecosystem-atmosphere fluxes demonstrate marked improvements in worst-case performance, with only minor losses in average performance.

LGOct 29, 2025
Transferring Causal Effects using Proxies

Manuel Iglesias-Alonso, Felix Schur, Julius von Kügelgen et al.

We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.

MLApr 25, 2025
Representation Learning for Distributional Perturbation Extrapolation

Julius von Kügelgen, Jakob Ketterer, Xinwei Shen et al.

We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. More precisely, we formulate the generative process underlying the observed data as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. Unlike previous work, we prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to affine transformation, and use this to characterize the class of unseen perturbations for which we obtain extrapolation guarantees. To estimate the model from data, we propose a new method, the perturbation distribution autoencoder (PDAE), which is trained by maximising the distributional similarity between true and predicted perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. Empirical evidence suggests that PDAE compares favourably to existing methods and baselines at predicting the effects of unseen perturbations.

MLDec 11, 2025
Maximum Risk Minimization with Random Forests

Francesco Freni, Anya Fries, Linus Kühne et al.

We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.

MLJun 11, 2024
DecoR: Deconfounding Time Series with Robust Regression

Felix Schur, Jonas Peters

Causal inference on time series data is a challenging problem, especially in the presence of unobserved confounders. This work focuses on estimating the causal effect between two time series that are confounded by a third, unobserved time series. Assuming spectral sparsity of the confounder, we show how in the frequency domain this problem can be framed as an adversarial outlier problem. We introduce Deconfounding by Robust regression (DecoR), a novel approach that estimates the causal effect using robust linear regression in the frequency domain. Considering two different robust regression techniques, we first improve existing bounds on the estimation error for such techniques. Crucially, our results do not require distributional assumptions on the covariates. We can therefore use them in time series settings. Applying these results to DecoR, we prove, under suitable assumptions, upper bounds for the estimation error of DecoR that imply consistency. We demonstrate DecoR's effectiveness through experiments on both synthetic and real-world data from Earth system science. The simulation experiments furthermore suggest that DecoR is robust with respect to model misspecification.

MLApr 15, 2024
Invariant Subspace Decomposition

Margherita Lazzaretto, Jonas Peters, Niklas Pfister

We consider the task of predicting a response Y from a set of covariates X in settings where the conditional distribution of Y given X changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. To additionally exploit observations further in the past, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be utilized both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict Y at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide finite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure.

MLFeb 3, 2022
Exploiting Independent Instruments: Identification and Distribution Generalization

Sorawit Saengkyongam, Leonard Henckel, Niklas Pfister et al.

Instrumental variable models allow us to identify a causal function between covariates $X$ and a response $Y$, even in the presence of unobserved confounding. Most of the existing estimators assume that the error term in the response $Y$ and the hidden confounders are uncorrelated with the instruments $Z$. This is often motivated by a graphical separation, an argument that also justifies independence. Positing an independence restriction, however, leads to strictly stronger identifiability results. We connect to the existing literature in econometrics and provide a practical method called HSIC-X for exploiting independence that can be combined with any gradient-based learning procedure. We see that even in identifiable settings, taking into account higher moments may yield better finite sample results. Furthermore, we exploit the independence for distribution generalization. We prove that the proposed estimator is invariant to distributional shifts on the instruments and worst-case optimal whenever these shifts are sufficiently strong. These results hold even in the under-identified case where the instruments are not sufficiently rich to identify the causal function.

MEFeb 2, 2022
Invariant Ancestry Search

Phillip B. Mogensen, Nikolaj Thams, Jonas Peters

Recently, methods have been proposed that exploit the invariance of prediction models with respect to changing environments to infer subsets of the causal parents of a response variable. If the environments influence only few of the underlying mechanisms, the subset identified by invariant causal prediction (ICP), for example, may be small, or even empty. We introduce the concept of minimal invariance and propose invariant ancestry search (IAS). In its population version, IAS outputs a set which contains only ancestors of the response and is a superset of the output of ICP. When applied to data, corresponding guarantees hold asymptotically if the underlying test for invariance has asymptotic level and power. We develop scalable algorithms and perform experiments on simulated and real data.

MLAug 19, 2021
Structure Learning for Directed Trees

Martin Emil Jakobsen, Rajen D. Shah, Peter Bühlmann et al.

Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large, continuous, and nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu-Liu-Edmonds' algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce two methods for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods.

LGJun 1, 2021
Invariant Policy Learning: A Causal Perspective

Sorawit Saengkyongam, Nikolaj Thams, Jonas Peters et al.

Contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real-world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we take a step toward tackling the problem of environmental shifts considering the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved variables are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions. Our results establish concrete connections among causality, invariance, and contextual bandits.

LGMar 3, 2021
Regularizing towards Causal Invariance: Linear Models with Proxies

Michael Oberst, Nikolaj Thams, Jonas Peters et al.

We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a single proxy can be used to create estimators that are prediction optimal under interventions of bounded strength. This strength depends on the magnitude of the measurement noise in the proxy, which is, in general, not identifiable. In the case of two proxy variables, we propose a modified estimator that is prediction optimal under interventions up to a known strength. We further show how to extend these estimators to scenarios where additional information about the "test time" intervention is available during training. We evaluate our theoretical findings in synthetic experiments and using real data of hourly pollution levels across several cities in China.

EMMay 7, 2020
Distributional robustness of K-class estimators and the PULSE

Martin Emil Jakobsen, Jonas Peters

While causal models are robust in that they are prediction optimal under arbitrarily strong interventions, they may not be optimal when the interventions are bounded. We prove that the classical K-class estimator satisfies such optimality by establishing a connection between K-class estimators and anchor regression. This connection further motivates a novel estimator in instrumental variable settings that minimizes the mean squared prediction error subject to the constraint that the estimator lies in an asymptotically valid confidence region of the causal coefficient. We call this estimator PULSE (p-uncorrelated least squares estimator), relate it to work on invariance, show that it can be computed efficiently as a data-driven K-class estimator, even though the underlying optimization problem is non-convex, and prove consistency. We evaluate the estimators on real data and perform simulation experiments illustrating that PULSE suffers from less variability. There are several settings including weak instrument settings, where it outperforms other estimators.

MLOct 28, 2018
Learning stable and predictive structures in kinetic systems: Benefits of a causal approach

Niklas Pfister, Stefan Bauer, Jonas Peters

Learning kinetic systems from data is one of the core challenges in many fields. Identifying stable models is essential for the generalization capabilities of data-driven inference. We introduce a computationally efficient framework, called CausalKinetiX, that identifies structure from discrete time, noisy observations, generated from heterogeneous experiments. The algorithm assumes the existence of an underlying, invariant kinetic model, a key criterion for reproducible research. Results on both simulated and real-world examples suggest that learning the structure of kinetic systems benefits from a causal perspective. The identified variables and models allow for a concise description of the dynamics across multiple experimental settings and can be used for prediction in unseen experiments. We observe significant improvements compared to well established approaches focusing solely on predictive performance, especially for out-of-sample generalization.

MENov 18, 2016
Foundations of Structural Causal Models with Cycles and Latent Variables

Stephan Bongers, Patrick Forré, Jonas Peters et al.

Structural causal models (SCMs), also known as (nonparametric) structural equation models (SEMs), are widely used for causal modeling purposes. In particular, acyclic SCMs, also known as recursive SEMs, form a well-studied subclass of SCMs that generalize causal Bayesian networks to allow for latent confounders. In this paper, we investigate SCMs in a more general setting, allowing for the presence of both latent confounders and cycles. We show that in the presence of cycles, many of the convenient properties of acyclic SCMs do not hold in general: they do not always have a solution; they do not always induce unique observational, interventional and counterfactual distributions; a marginalization does not always exist, and if it exists the marginal model does not always respect the latent projection; they do not always satisfy a Markov property; and their graphs are not always consistent with their causal semantics. We prove that for SCMs in general each of these properties does hold under certain solvability conditions. Our work generalizes results for SCMs with cycles that were only known for certain special cases so far. We introduce the class of simple SCMs that extends the class of acyclic SCMs to the cyclic setting, while preserving many of the convenient properties of acyclic SCMs. With this paper we aim to provide the foundations for a general theory of statistical causal modeling with SCMs.

STMar 1, 2016
Kernel-based Tests for Joint Independence

Niklas Pfister, Peter Bühlmann, Bernhard Schölkopf et al.

We investigate the problem of testing whether $d$ random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the $d$-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert space and define the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) as the squared distance between the embeddings. In the population case, the value of dHSIC is zero if and only if the $d$ variables are jointly independent, as long as the kernel is characteristic. Based on an empirical estimate of dHSIC, we define three different non-parametric hypothesis tests: a permutation test, a bootstrap test and a test based on a Gamma approximation. We prove that the permutation test achieves the significance level and that the bootstrap test achieves pointwise asymptotic significance level as well as pointwise asymptotic consistency (i.e., it is able to detect any type of fixed dependence in the large sample limit). The Gamma approximation does not come with these guarantees; however, it is computationally very fast and for small $d$, it performs well in practice. Finally, we apply the test to a problem in causal discovery.

MLJul 19, 2015
Invariant Models for Causal Transfer Learning

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner et al.

Methods of transfer learning try to combine knowledge from several related tasks (or domains) to improve performance on a test task. Inspired by causal methodology, we relax the usual covariate shift assumption and assume that it holds true for a subset of predictor variables: the conditional distribution of the target variable given this subset of predictors is invariant over all tasks. We show how this assumption can be motivated from ideas in the field of causality. We focus on the problem of Domain Generalization, in which no examples from the test task are observed. We prove that in an adversarial setting using this subset for prediction is optimal in Domain Generalization; we further provide examples, in which the tasks are sufficiently diverse and the estimator therefore outperforms pooling the data, even on average. If examples from the test task are available, we also provide a method to transfer knowledge from the training tasks and exploit all available features for prediction. However, we provide no guarantees for this method. We introduce a practical method which allows for automatic inference of the above subset and provide corresponding code. We present results on synthetic data sets and a gene deletion data set.

MEJun 8, 2015
backShift: Learning causal cyclic graphs from unknown shift interventions

Dominik Rothenhäusler, Christina Heinze, Jonas Peters et al.

We propose a simple method to learn linear causal cyclic models in the presence of latent variables. The method relies on equilibrium data of the model recorded under a specific kind of interventions ("shift interventions"). The location and strength of these interventions do not have to be known and can be estimated from the data. Our method, called backShift, only uses second moments of the data and performs simple joint matrix diagonalization, applied to differences between covariance matrices. We give a sufficient and necessary condition for identifiability of the system, which is fulfilled almost surely under some quite general assumptions if and only if there are at least three distinct experimental settings, one of which can be pure observational data. We demonstrate the performance on some simulated data and applications in flow cytometry and financial time series. The code is made available as R-package backShift.

MLMay 12, 2015
Removing systematic errors for exoplanet search via latent causes

Bernhard Schölkopf, David W. Hogg, Dun Wang et al.

We describe a method for removing the effect of confounders in order to reconstruct a latent quantity of interest. The method, referred to as half-sibling regression, is inspired by recent work in causal inference using additive noise models. We provide a theoretical justification and illustrate the potential of the method in a challenging astronomy application.

MLJan 27, 2015
Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations

Bernhard Schölkopf, Krikamol Muandet, Kenji Fukumizu et al.

We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on synthetic data, and show how it can be used for nonparametric structural equation models, with an application to causal inference.

MEJan 6, 2015
Causal inference using invariant prediction: identification and confidence intervals

Jonas Peters, Peter Bühlmann, Nicolai Meinshausen

What is the difference of a prediction that is made with a causal model and a non-causal model? Suppose we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (for example various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.

LGDec 11, 2014
Distinguishing cause from effect using observational data: methods and benchmarks

Joris M. Mooij, Jonas Peters, Dominik Janzing et al.

The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 datasets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the "ground truth" causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the additive-noise method originally proposed by Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of 0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.

PRMar 3, 2014
On the Intersection Property of Conditional Independence and its Application to Causal Discovery

Jonas Peters

This work investigates the intersection property of conditional independence. It states that for random variables $A,B,C$ and $X$ we have that $X$ independent of $A$ given $B,C$ and $X$ independent of $B$ given $A,C$ implies $X$ independent of $(A,B)$ given $C$. Under the assumption that the joint distribution has a continuous density, we provide necessary and sufficient conditions under which the intersection property holds. The result has direct applications to causal inference: it leads to strictly weaker conditions under which the graphical structure becomes identifiable from the joint distribution of an additive noise model.

MEOct 6, 2013
CAM: Causal additive models, high-dimensional order search and penalized regression

Peter Bühlmann, Jonas Peters, Jan Ernest

We develop estimation for potentially high-dimensional additive structural equation models. A key component of our approach is to decouple order search among the variables from feature or edge selection in a directed acyclic graph encoding the causal structure. We show that the former can be done with nonregularized (restricted) maximum likelihood estimation while the latter can be efficiently addressed using sparse regression techniques. Thus, we substantially simplify the problem of structure search and estimation for an important class of causal models. We establish consistency of the (restricted) maximum likelihood estimator for low- and high-dimensional scenarios, and we also allow for misspecification of the error distribution. Furthermore, we develop an efficient computational algorithm which can deal with many variables, and the new method's accuracy and performance is illustrated on simulated and real data.

LGSep 26, 2013
Identifying Finite Mixtures of Nonparametric Product Distributions and Causal Inference of Confounders

Eleni Sgouritsa, Dominik Janzing, Jonas Peters et al.

We propose a kernel method to identify finite mixtures of nonparametric product distributions. It is based on a Hilbert space embedding of the joint distribution. The rank of the constructed tensor is equal to the number of mixture components. We present an algorithm to recover the components by partitioning the data points into clusters such that the variables are jointly conditionally independent given the cluster. This method can be used to identify finite confounders.

MLSep 26, 2013
Causal Discovery with Continuous Additive Noise Models

Jonas Peters, Joris Mooij, Dominik Janzing et al.

We consider the problem of learning causal directed acyclic graphs from an observational joint distribution. One can use these graphs to predict the outcome of interventional experiments, from which data are often not available. We show that if the observational distribution follows a structural equation model with an additive noise structure, the directed acyclic graph becomes identifiable from the distribution under mild conditions. This constitutes an interesting alternative to traditional methods that assume faithfulness and identify only the Markov equivalence class of the graph, thus leaving some edges undirected. We provide practical algorithms for finitely many samples, RESIT (Regression with Subsequent Independence Test) and two methods based on an independence score. We prove that RESIT is correct in the population setting and provide an empirical evaluation.

LGSep 11, 2012
Counterfactual Reasoning and Learning Systems

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela et al.

This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.

MLJul 21, 2012
Causal Inference on Time Series using Structural Equation Models

Jonas Peters, Dominik Janzing, Bernhard Schölkopf

Causal inference uses observations to infer the causal structure of the data generating system. We study a class of functional models that we call Time Series Models with Independent Noise (TiMINo). These models require independent residual time series, whereas traditional methods like Granger causality exploit the variance of residuals. There are two main contributions: (1) Theoretical: By restricting the model class (e.g. to additive noise) we can provide a more general identifiability result than existing ones. This result incorporates lagged and instantaneous effects that can be nonlinear and do not need to be faithful, and non-instantaneous feedbacks between the time series. (2) Practical: If there are no feedback loops between time series, we propose an algorithm based on non-linear independence tests of time series. When the data are causally insufficient, or the data generating process does not satisfy the model assumptions, this algorithm may still give partial results, but mostly avoids incorrect answers. An extension to (non-instantaneous) feedbacks is possible, but not discussed. It outperforms existing methods on artificial and real data. Code can be provided upon request.

LGJun 27, 2012
On Causal and Anticausal Learning

Bernhard Schoelkopf, Dominik Janzing, Jonas Peters et al.

We consider the problem of function estimation in the case where an underlying causal model can be inferred. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results.

MLMay 11, 2012
Identifiability of Gaussian structural equation models with equal error variances

Jonas Peters, Peter Bühlmann

We consider structural equation models in which variables can be written as a function of their parents and noise terms, which are assumed to be jointly independent. Corresponding to each structural equation model, there is a directed acyclic graph describing the relationships between the variables. In Gaussian structural equation models with linear functions, the graph can be identified from the joint distribution only up to Markov equivalence classes, assuming faithfulness. In this work, we prove full identifiability if all noise variables have the same variances: the directed acyclic graph can be recovered from the joint Gaussian distribution. Our result has direct implications for causal inference: if the data follow a Gaussian structural equation model with equal error variances and assuming that all variables are observed, the causal structure can be inferred from observational data only. We propose a statistical method and an algorithm that exploit our theoretical findings.

MLMay 9, 2012
Identifying confounders using additive noise models

Dominik Janzing, Jonas Peters, Joris Mooij et al.

We propose a method for inferring the existence of a latent common cause ('confounder') of two observed random variables. The method assumes that the two effects of the confounder are (possibly nonlinear) functions of the confounder plus independent, additive noise. We discuss under which conditions the model is identifiable (up to an arbitrary reparameterization of the confounder) from the joint distribution of the effects. We state and prove a theoretical result that provides evidence for the conjecture that the model is generically identifiable under suitable technical conditions. In addition, we propose a practical method to estimate the confounder from a finite i.i.d. sample of the effects and illustrate that the method works well on both simulated and real-world data.

LGFeb 14, 2012
Detecting low-complexity unobserved causes

Dominik Janzing, Eleni Sgouritsa, Oliver Stegle et al.

We describe a method that infers whether statistical dependences between two observed variables X and Y are due to a "direct" causal link or only due to a connecting causal path that contains an unobserved variable of low complexity, e.g., a binary variable. This problem is motivated by statistical genetics. Given a genetic marker that is correlated with a phenotype of interest, we want to detect whether this marker is causal or it only correlates with a causal one. Our method is based on the analysis of the location of the conditional distributions P(Y|x) in the simplex of all distributions of Y. We report encouraging results on semi-empirical data.

LGFeb 14, 2012
Kernel-based Conditional Independence Test and Application in Causal Discovery

Kun Zhang, Jonas Peters, Dominik Janzing et al.

Conditional independence testing is an important problem, especially in Bayesian network learning and causal discovery. Due to the curse of dimensionality, testing for conditional independence of continuous variables is particularly challenging. We propose a Kernel-based Conditional Independence test (KCI-test), by constructing an appropriate test statistic and deriving its asymptotic distribution under the null hypothesis of conditional independence. The proposed method is computationally efficient and easy to implement. Experimental results show that it outperforms other methods, especially when the conditioning set is large or the sample size is not very large, in which case other methods encounter difficulties.

LGFeb 14, 2012
Identifiability of Causal Graphs using Functional Models

Jonas Peters, Joris Mooij, Dominik Janzing et al.

This work addresses the following question: Under what assumptions on the data generating process can one infer the causal graph from the joint distribution? The approach taken by conditional independence-based causal discovery methods is based on two assumptions: the Markov condition and faithfulness. It has been shown that under these assumptions the causal graph can be identified up to Markov equivalence (some arrows remain undirected) using methods like the PC algorithm. In this work we propose an alternative by defining Identifiable Functional Model Classes (IFMOCs). As our main theorem we prove that if the data generating process belongs to an IFMOC, one can identify the complete causal graph. To the best of our knowledge this is the first identifiability result of this kind that is not limited to linear functional relationships. We discuss how the IFMOC assumption and the Markov and faithfulness assumptions relate to each other and explain why we believe that the IFMOC assumption can be tested more easily on given data. We further provide a practical algorithm that recovers the causal graph from finitely many data; experiments on simulated data support the theoretical findings.