LGJul 8, 2024

Periodic agent-state based Q-learning for POMDPs

arXiv:2407.06121v36 citationsh-index: 27
AI Analysis

This work addresses a specific bottleneck in reinforcement learning for POMDPs by introducing periodic policies, offering a novel but incremental improvement over existing agent-state methods.

The paper tackles the problem of reinforcement learning in partially observable environments by proposing PASQL, a method that learns periodic policies instead of stationary ones, demonstrating through numerical experiments that periodic policies can outperform stationary policies in agent-state based Q-learning.

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes