Percentile Criterion Optimization in Offline Reinforcement Learning
This work addresses the problem of robust policy learning in high-stakes decision-making with limited data for reinforcement learning practitioners, offering a novel method to reduce conservatism.
The paper tackles the challenge of optimizing the percentile criterion in offline reinforcement learning, which often leads to overly conservative policies due to large ambiguity sets. The proposed Value-at-Risk based dynamic programming algorithm learns less conservative robust policies by implicitly constructing smaller ambiguity sets, as shown in theoretical and empirical results.
In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.