LG MLOct 18, 2024

Harnessing Causality in Reinforcement Learning With Bagged Decision Times

Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, Susan A. Murphy

arXiv:2410.14659v311.57 citationsh-index: 63

Originality Incremental advance

AI Analysis

This addresses RL challenges in domains like mobile health where decisions are grouped and rewards are delayed, offering a method to handle non-Markovian and non-stationary dynamics, though it appears incremental as it builds on existing MDP frameworks with causal extensions.

The paper tackles reinforcement learning for problems with bagged decision times, where multiple actions within a bag jointly affect a single reward, by developing an online RL algorithm that uses a causal DAG to construct Markov states and formulates the problem as a periodic MDP, achieving the maximal optimal value function and evaluating on real mobile health data.

We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collectively affect a user's daily commitment to being active. Our goal is to develop an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markov state transitions within and across bags. We then formulate this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman equations for stationary MDPs is generalized to handle periodic MDPs. We show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Finally, we evaluate the proposed method on testbed variants built from real data in a mobile health clinical trial.

View on arXiv PDF

Similar