LG AI MLSep 30, 2024

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

arXiv:2409.20521v117.615 citationsh-index: 6

Originality Highly original

AI Analysis

This work provides a near-optimal and computationally efficient algorithm for robust policy learning in dynamic environments, which is crucial for real-world RL applications where environmental shifts are common.

This paper addresses off-dynamics Reinforcement Learning (RL) where training and deployment environments differ, focusing on distributionally robust Markov decision processes (DRMDPs). The proposed We-DRIVE-U algorithm achieves an average suboptimality of $\widetilde{\mathcal{O}}(dH \cdot \min \{1/\rho, H\}/\sqrt{K})$, improving the state-of-the-art by $\mathcal{O}(dH/\min\{1/\rho,H\})$.

We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/ρ, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ is the horizon length, $d$ is the feature dimension and $ρ$ is the uncertainty level. This result improves the state-of-the-art by $\mathcal{O}(dH/\min\{1/ρ,H\})$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to $\mathcal{O}(\sqrt{H})$ for any uncertainty level $ρ\in(0,1]$. Our algorithm also enjoys a 'rare-switching' design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both $\mathcal{O}(K)$.

View on arXiv PDF

Similar