LG AI MLDec 20, 2023

Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach

arXiv:2312.12731v15.32 citationsh-index: 5AAAI

Originality Incremental advance

AI Analysis

This addresses the challenge of leveraging biased offline data for bandit learning, which is crucial for applications like recommendation systems and clinical trials, though it is incremental as it builds on existing causal methods.

The paper tackles the problem of improving bandit algorithms using offline data that suffers from confounding and selection biases, by deriving causal bounds that contain the true mean reward and guide the agent to a nearly-optimal policy, showing consistent reduction in asymptotic regret in contextual and non-contextual settings.

This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm's reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help consistently reduce the asymptotic regret.

View on arXiv PDF

Similar