Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence
This work addresses the problem of efficient reinforcement learning in continuous state spaces for researchers and practitioners in the field of reinforcement learning.
The authors tackled the problem of reinforcement learning in continuous state spaces and achieved an efficient implementation with O(n) memory and computation cost per iteration, with a proven almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. The method also bounds the approximation error between this limit and the optimal Q* as a function of the kernel bandwidth.
We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.