LG AI MLJan 2, 2021

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Minbo Gao, Tianle Xie, Simon S. Du, Lin F. Yang

arXiv:2101.00494v123.251 citations

Originality Highly original

AI Analysis

This addresses the need for efficient algorithms in applications like medical domains and recommendation systems where frequent policy changes are costly, offering a provable improvement over prior work.

The paper tackles the problem of reinforcement learning in linear Markov Decision Processes with a constraint on the number of policy changes, presenting an algorithm that achieves an $\widetilde{O}\left(\sqrt{d^3H^4K} ight)$ regret bound with a near-optimal $O\left(d H\log K ight)$ global switching cost, matching the best existing regret while exponentially reducing switching costs.

Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in [Yang et al 2019, Jin et al 2020] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an $\widetilde{O}\left(\sqrt{d^3H^4K}\right)$ regret bound with a near-optimal $O\left(d H\log K\right)$ global switching cost where $d$ is the feature dimension, $H$ is the planning horizon and $K$ is the number of episodes the agent plays. Our regret bound matches the best existing polynomial algorithm by [Jin et al 2020] and our switching cost is exponentially smaller than theirs. When specialized to tabular MDP, our switching cost bound improves those in [Bai et al 2019, Zhang et al 20020]. We complement our positive result with an $Ω\left(dH/\log d\right)$ global switching cost lower bound for any no-regret algorithm.

View on arXiv PDF

Similar