Causal bandits with backdoor adjustment on unknown Gaussian DAGs
This addresses the problem of sequential decision-making under unknown causal structures for researchers in causal inference and bandit algorithms, representing a novel method for a known bottleneck rather than an incremental improvement.
The paper tackles the causal bandit problem where the causal graph structure is unknown, assuming a Gaussian linear DAG, by developing a novel algorithm that uses backdoor adjustment sets from experimental and observational data to estimate causal effects and construct upper confidence bounds. The result includes improved cumulative regret bounds over standard multi-armed bandit algorithms, with empirical advantages in regret and computation time.
The causal bandit problem aims to sequentially learn the intervention that maximizes the expectation of a reward variable within a system governed by a causal graph. Most existing approaches assume prior knowledge of the graph structure, or impose unrealistically restrictive conditions on the graph. In this paper, we assume a Gaussian linear directed acyclic graph (DAG) over arms and the reward variable, and study the causal bandit problem when the graph structure is unknown. We identify backdoor adjustment sets for each arm using sequentially generated experimental and observational data during the decision process, which allows us to estimate causal effects and construct upper confidence bounds. By integrating estimates from both data sources, we develop a novel bandit algorithm, based on modified upper confidence bounds, to sequentially determine the optimal intervention. We establish both case-dependent and case-independent upper bounds on the cumulative regret for our algorithm, which improve upon the bounds of the standard multi-armed bandit algorithms. Our empirical study demonstrates its advantage over existing methods with respect to cumulative regret and computation time.