Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process
This addresses the challenge of applying Q-learning in partially observable environments for reinforcement learning practitioners, though it appears incremental as it builds on existing HMM and Q-learning methods.
The paper tackles the problem of Q-learning performing poorly in partially observable Markov decision processes (POMDPs) by proposing an online Hidden Markov Model (HMM) estimation-based Q-learning algorithm, showing that the POMDP estimation converges to stationary points and the Q function converges to a fixed point satisfying the Bellman optimality equation.
The objective is to study an on-line Hidden Markov model (HMM) estimation-based Q-learning algorithm for partially observable Markov decision process (POMDP) on finite state and action sets. When the full state observation is available, Q-learning finds the optimal action-value function given the current action (Q function). However, Q-learning can perform poorly when the full state observation is not available. In this paper, we formulate the POMDP estimation into a HMM estimation problem and propose a recursive algorithm to estimate both the POMDP parameter and Q function concurrently. Also, we show that the POMDP estimation converges to a set of stationary points for the maximum likelihood estimate, and the Q function estimation converges to a fixed point that satisfies the Bellman optimality equation weighted on the invariant distribution of the state belief determined by the HMM estimation process.