LG SPFeb 8, 2024

Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy Optimization

arXiv:2402.05476v110.48 citationsh-index: 5Has CodeIEEE Transactions on Signal Processing

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in reinforcement learning for network control, offering incremental improvements in policy optimization for large-scale Markov decision processes.

The paper tackles the performance and complexity challenges of Q-learning in large networks by proposing a model-free ensemble RL algorithm that runs multiple Q-learning instances on synthetic Markovian environments and fuses outputs with an adaptive weighting mechanism. It achieves up to 55% less average policy error and up to 50% less runtime complexity compared to state-of-the-art Q-learning algorithms.

Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original Q-learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple Q-learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and Q-functions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art Q-learning algorithms. Numerical results validate assumptions made in the theoretical analysis.

View on arXiv PDF Code

Similar