LG OCMay 23, 2024

Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality

Andrew Rosemberg, Alexandre Street, Davi M. Valladão, Pascal Van Hentenryck

arXiv:2405.14973v24.61 citationsh-index: 23

Originality Highly original

AI Analysis

This addresses the problem of sample inefficiency and computational challenges in strictly constrained CMDPs for domains like power systems and robotics, offering a novel algorithmic approach.

The paper tackles the challenge of training deep-learning policies for constrained Markov decision processes (CMDPs) in high-stakes applications like power systems, introducing a Two-Stage Deep Decision Rules (TS-DDR) method that uses Lagrangian duality. The result shows TS-DDR enhances solution quality and reduces computation times by several orders of magnitude compared to state-of-the-art methods on a real-world power system problem.

Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications, where decisions must optimize cumulative rewards while strictly adhering to complex nonlinear constraints. In domains such as power systems, finance, supply chains, and precision robotics, violating these constraints can result in significant financial or societal costs. Existing Reinforcement Learning (RL) methods often struggle with sample efficiency and effectiveness in finding feasible policies for highly and strictly constrained CMDPs, limiting their applicability in these environments. Stochastic dual dynamic programming is often used in practice on convex relaxations of the original problem, but they also encounter computational challenges and loss of optimality. This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS-DDR), to efficiently train parametric actor policies using Lagrangian Duality. TS-DDR is a self-supervised learning algorithm that trains general decision rules (parametric policies) using stochastic gradient descent (SGD); its forward passes solve {\em deterministic} optimization problems to find feasible policies, and its backward passes leverage duality theory to train the parametric policy with closed-form gradients. TS-DDR inherits the flexibility and computational performance of deep learning methodologies to solve CMDP problems. Applied to the Long-Term Hydrothermal Dispatch (LTHD) problem using actual power system data from Bolivia, TS-DDR is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.

View on arXiv PDF

Similar