Quantile Filtered Imitation Learning
This work addresses offline reinforcement learning for researchers and practitioners by providing a method to trade off bias and variance, though it is incremental as it builds on existing imitation learning and filtering techniques.
The authors tackled the problem of offline reinforcement learning by introducing quantile filtered imitation learning (QFIL), a policy improvement operator that filters low-value state-action pairs based on a quantile of the pushforward Q distribution, resulting in a safe policy improvement step with demonstrated performance on the D4RL benchmark.
We introduce quantile filtered imitation learning (QFIL), a novel policy improvement operator designed for offline reinforcement learning. QFIL performs policy improvement by running imitation learning on a filtered version of the offline dataset. The filtering process removes $ s,a $ pairs whose estimated Q values fall below a given quantile of the pushforward distribution over values induced by sampling actions from the behavior policy. The definitions of both the pushforward Q distribution and resulting value function quantile are key contributions of our method. We prove that QFIL gives us a safe policy improvement step with function approximation and that the choice of quantile provides a natural hyperparameter to trade off bias and variance of the improvement step. Empirically, we perform a synthetic experiment illustrating how QFIL effectively makes a bias-variance tradeoff and we see that QFIL performs well on the D4RL benchmark.