Average-Reward Learning and Planning with Options
This work addresses the problem of temporal abstraction in reinforcement learning for continuing tasks, but it is incremental as it extends existing discounted methods to the average-reward setting.
The authors extended the options framework for temporal abstraction from discounted to average-reward Markov decision processes, developing convergent off-policy learning algorithms and sample-based planning variants, and demonstrated efficacy in experiments on a continuing Four-Room domain.
We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample-based planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.