LG MLMay 14, 2019

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models

arXiv:1905.05824v327.8197 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for debugging and reviewing specific episodes in high-risk applications such as healthcare, though it is incremental as it builds on existing off-policy evaluation and structural causal model methods.

The paper tackles the problem of identifying episodes where a reinforcement learning policy would have produced substantially different outcomes than an observed policy in high-risk settings like healthcare, by introducing a counterfactual off-policy evaluation procedure using Gumbel-Max structural causal models, and demonstrates its utility in a synthetic sepsis management environment.

We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.

View on arXiv PDF Code

Similar