Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning
This addresses the problem of reliably evaluating decision policies without costly exploration for researchers and practitioners in reinforcement learning, representing an incremental improvement over existing methods.
The paper tackled the problem of off-policy evaluation in reinforcement learning, where existing methods like importance sampling and doubly robust estimates have limitations in efficiency, stability, and boundedness; the result was new estimators based on empirical likelihood that are always more efficient and inherit stability and boundedness properties, with empirical studies showing advantages.
Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.