LGJul 8, 2024

Periodic agent-state based Q-learning for POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan

arXiv:2407.06121v312.56 citationsh-index: 27

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in reinforcement learning for POMDPs by introducing periodic policies, offering a novel but incremental improvement over existing agent-state methods.

The paper tackles the problem of reinforcement learning in partially observable environments by proposing PASQL, a method that learns periodic policies instead of stationary ones, demonstrating through numerical experiments that periodic policies can outperform stationary policies in agent-state based Q-learning.

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

View on arXiv PDF

Similar