Contextual Markov Decision Processes
This work addresses the problem of optimizing interactions in scenarios like website-customer dynamics, where customer characteristics are hidden, but it is incremental as it focuses on a basic finite-horizon scenario with a small known number of contexts.
The paper tackles the problem of planning in environments where dynamics and rewards depend on a hidden static parameter (context), aiming to maximize accumulated reward across all contexts. It introduces the Contextual Markov Decision Process (CMDP) model and proposes algorithms with provable guarantees for learning models and latent contexts, with bounds provided for specific implementations.
We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer's behavior when interacting with a website (the learner). The customer's behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario--finite horizon with a small known number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.