LGMay 25, 2022

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

arXiv:2205.13044v19.69 citationsh-index: 40

Originality Highly original

AI Analysis

This work addresses the challenge of reinforcement learning in changing environments, which is incremental as it builds on prior techniques like finite-horizon approximation and adaptive confidence widening.

The paper tackles the problem of goal-oriented reinforcement learning in non-stationary environments by establishing a lower bound on dynamic regret and developing algorithms that achieve near-optimal regret, with results including a minimax optimal algorithm up to logarithmic terms and a variant achieving $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(Δ_c+B_{\star}Δ_P))^{1/3}K^{2/3}\})$ regret when changes are unknown.

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $Ω((B_{\star} SAT_{\star}(Δ_c + B_{\star}^2Δ_P))^{1/3}K^{2/3})$, where $B_{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T_{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $Δ_c$ and $Δ_P$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $Δ_c$ and $Δ_P$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $Δ_c$ and $Δ_P$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $Δ_c$ and $Δ_P$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(Δ_c+B_{\star}Δ_P))^{1/3}K^{2/3}\})$ regret, where $L$ is the unknown number of changes of the environment.

View on arXiv PDF

Similar