ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning
For researchers in model-free RL, this work offers practical modifications to improve sample efficiency in episodic MDPs, though the improvements are demonstrated only in small-scale environments.
The paper identifies three opportunities for faster Q-learning in episodic online RL and proposes ReversedQ, which improves scaled mean cumulative reward from 9.53% to 78.78% in BDCL and from 21.76% to 61.81% in a chain MDP compared to RandomizedQ.
We study model-free Q-learning in finite-horizon episodic Markov Decision Processes (MDPs) with stationary dynamics across episodes. We identify a central issue in nascent model-free posterior-sampling works: the reliance on delayed learning in order to prove theoretical guarantees. In particular, we identify three opportunities for faster learning - (i) value-function update order, (ii) update frequencies, and (iii) value-function initialization. Using Wang et al.'s RandomizedQ as a basis, we illustrate these changes and their individual (as well as cumulative) impact in multiple empirical studies. We find that our combined modifications, termed ReversedQ, improve scaled mean cumulative reward compared to RandomizedQ, from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock (BDCL), and from 21.76% to 61.81% in a chain MDP.