ML LGOct 26, 2022

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

Yunzhe Zhou, Zhengling Qi, Chengchun Shi, Lexin Li

arXiv:2210.14420v212.410 citationsh-index: 33Has Code

Originality Incremental advance

AI Analysis

This work addresses a key limitation in offline reinforcement learning for dynamic treatment regimes, offering a more robust and tunable-free method for healthcare or policy applications, though it is incremental in improving existing pessimism-based approaches.

The authors tackled the problem of sub-optimal policies in offline dynamic treatment regimes when coverage conditions fail, by proposing a Bayesian learning method that optimizes pessimism without tuning hyper-parameters, resulting in outperformance over state-of-the-art solutions in simulations and real data.

In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.

View on arXiv PDF Code

Similar