Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
This work addresses reinforcement learning for weakly communicating SMDPs, representing an incremental advancement with new monotonicity conditions to expand the algorithmic framework.
The paper tackles the problem of reinforcement learning in average-reward semi-Markov decision processes by establishing the convergence of an asynchronous relative value iteration Q-learning algorithm, showing it converges almost surely to solutions of the optimality equation under certain conditions.
This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.