Min Max Generalization for Two-stage Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes
This work addresses computational efficiency in batch mode reinforcement learning for deterministic settings, offering incremental improvements over existing methods.
The authors tackled the NP-hard minmax optimization problem for two-stage deterministic batch mode reinforcement learning by proposing two relaxation schemes, one dropping constraints for polynomial-time solvability and another using Lagrangian relaxation to form a conic quadratic program, both theoretically and empirically outperforming prior results.
We study the minmax optimization problem introduced in [22] for computing policies for batch mode reinforcement learning in a deterministic setting. First, we show that this problem is NP-hard. In the two-stage case, we provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, leads to a conic quadratic programming problem. We also theoretically prove and empirically illustrate that both relaxation schemes provide better results than those given in [22].