A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost
This addresses the need for efficient algorithms in applications like medical domains and recommendation systems where frequent policy changes are costly, offering a provable improvement over prior work.
The paper tackles the problem of reinforcement learning in linear Markov Decision Processes with a constraint on the number of policy changes, presenting an algorithm that achieves an $\widetilde{O}\left(\sqrt{d^3H^4K} ight)$ regret bound with a near-optimal $O\left(d H\log K ight)$ global switching cost, matching the best existing regret while exponentially reducing switching costs.
Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in [Yang et al 2019, Jin et al 2020] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an $\widetilde{O}\left(\sqrt{d^3H^4K}\right)$ regret bound with a near-optimal $O\left(d H\log K\right)$ global switching cost where $d$ is the feature dimension, $H$ is the planning horizon and $K$ is the number of episodes the agent plays. Our regret bound matches the best existing polynomial algorithm by [Jin et al 2020] and our switching cost is exponentially smaller than theirs. When specialized to tabular MDP, our switching cost bound improves those in [Bai et al 2019, Zhang et al 20020]. We complement our positive result with an $Ω\left(dH/\log d\right)$ global switching cost lower bound for any no-regret algorithm.