Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning
This addresses a critical bottleneck for researchers and practitioners in offline RL by eliminating the need for hyperparameter tuning in policy selection, though it is incremental as it builds on existing theoretical advances.
The paper tackles the problem of hyperparameter-free policy selection in offline reinforcement learning by designing algorithms based on BVFT for discrete-action benchmarks and combining BVFT with off-policy evaluation for continuous-action domains, achieving effective performance without additional hyperparameters.
How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL) -- which is crucial for hyperpa-rameter tuning -- is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we design hyperparameter-free algorithms for policy selection based on BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari. To address performance degradation due to poor critics in continuous-action domains, we further combine BVFT with OPE to get the best of both worlds, and obtain a hyperparameter-tuning method for Q-function based OPE with theoretical guarantees as a side product.