LG MLMay 8, 2024

Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data

Kishan Panaganti, Adam Wierman, Eric Mazumdar

arXiv:2405.05468v114.23 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses robustness in reinforcement learning for applications where simulators differ from real-world settings, offering incremental improvements in handling data distribution assumptions.

The paper tackles robust reinforcement learning under model uncertainty by proposing model-free algorithms for learning optimal robust policies using offline and hybrid data, achieving theoretical guarantees for high-dimensional systems with general function approximation.

The robust $φ$-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust $φ$-regularized fitted Q-iteration (RPQ) for learning an $ε$-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $φ$-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $φ$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $φ$-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

View on arXiv PDF

Similar