LGAug 29, 2025

Convergence of regularized agent-state-based Q-learning in POMDPs

arXiv:2508.21314v2h-index: 42CDC
Originality Incremental advance
AI Analysis

This work provides theoretical guarantees for practical reinforcement learning methods in POMDPs, which is incremental as it builds on existing Q-learning frameworks.

The paper tackles the convergence of Q-learning algorithms in partially observable Markov decision processes (POMDPs) using agent states and policy regularization, showing that the algorithm converges to a fixed point under mild conditions, with numerical examples confirming the theoretical results.

In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes