Robust Transfer Learning with Side Information
This work addresses the problem of overly conservative policies in robust MDPs for practitioners dealing with environmental shifts, by providing a method to derive less pessimistic policies.
This paper proposes a framework for robust transfer learning in Markov Decision Processes (MDPs) under environmental shifts. It uses estimate-centered uncertainty sets constructed by integrating limited target samples with side information about source-target dynamics, leading to improved kernel estimates and tighter uncertainty sets. The approach consistently demonstrates superior target-domain performance over state-of-the-art robust and non-robust baselines across various OpenAI Gym environments and classic control problems.
Robust Markov Decision Processes (MDPs) address environmental shift through distributionally robust optimization (DRO) by finding an optimal worst-case policy within an uncertainty set of transition kernels. However, standard DRO approaches require enlarging the uncertainty set under large shifts, which leads to overly conservative and pessimistic policies. In this paper, we propose a framework for transfer under environment shift that derives a robust target-domain policy via estimate-centered uncertainty sets, constructed through constrained estimation that integrates limited target samples with side information about the source-target dynamics. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets. Error bounds and convergence results are established for both robust and non-robust value functions. Moreover, we provide a finite-sample guarantee on the learned robust policy and analyze the robust sub-optimality gap. Under mild low-dimensional structure on the transition model, the side information reduces this gap and improves sample efficiency. We assess the performance of our approach across OpenAI Gym environments and classic control problems, consistently demonstrating superior target-domain performance over state-of-the-art robust and non-robust baselines.