Deep reinforcement learning for weakly coupled MDP's with continuous actions
This work addresses resource-constrained environments in reinforcement learning, offering a novel method for continuous action spaces, but it is incremental as it builds on existing weakly coupled MDP frameworks.
The paper tackles the problem of reinforcement learning in weakly coupled Markov Decision Processes with continuous actions under resource constraints, introducing the LPCA algorithm which effectively decouples the MDP and demonstrates robustness and efficiency in resource allocation and reward maximization compared to state-of-the-art methods.
This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA's robustness and efficiency in managing resource allocation while maximizing rewards.