Performance Improvement Bounds for Lipschitz Configurable Markov Decision Processes
This work provides theoretical bounds for Conf-MDPs, which model environments with configurable parameters, but it is incremental as it extends existing results.
The paper tackles the problem of bounding performance improvements in configurable Markov decision processes (Conf-MDPs) under Lipschitz continuity, deriving a novel lower bound for performance improvement.
Configurable Markov Decision Processes (Conf-MDPs) have recently been introduced as an extension of the traditional Markov Decision Processes (MDPs) to model the real-world scenarios in which there is the possibility to intervene in the environment in order to configure some of its parameters. In this paper, we focus on a particular subclass of Conf-MDP that satisfies regularity conditions, namely Lipschitz continuity. We start by providing a bound on the Wasserstein distance between $γ$-discounted stationary distributions induced by changing policy and configuration. This result generalizes the already existing bounds both for Conf-MDPs and traditional MDPs. Then, we derive a novel performance improvement lower bound.