Minimizing the Outage Probability in a Markov Decision Process
This work addresses the need for risk-aware decision-making in reinforcement learning, though it appears incremental as an extension of existing methods.
The authors tackled the problem of optimizing policies in Markov decision processes for the probability of exceeding a given gain threshold, rather than the expected gain, and developed an algorithm extending value iteration with potential for neural network generalization.
Standard Markov decision process (MDP) and reinforcement learning algorithms optimize the policy with respect to the expected gain. We propose an algorithm which enables to optimize an alternative objective: the probability that the gain is greater than a given value. The algorithm can be seen as an extension of the value iteration algorithm. We also show how the proposed algorithm could be generalized to use neural networks, similarly to the deep Q learning extension of Q learning.