LG AI MLFeb 11, 2025

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang

arXiv:2502.08021v315.75 citationsh-index: 8

Originality Incremental advance

AI Analysis

This addresses a critical but under-investigated issue in offline RL for researchers and practitioners, though it is incremental as it builds on existing OPE methods.

The paper tackles the problem of hyperparameter tuning for off-policy evaluation in offline reinforcement learning by developing new model-free and model-based selectors with theoretical guarantees and a new experimental protocol. It finds that the new model-free selector, LSTD-Tournament, shows promising empirical performance on Gym-Hopper.

Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

View on arXiv PDF

Similar