Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective
This work addresses risk-sensitive and multi-objective decision-making in stochastic planning, which is incremental as it extends existing MDP frameworks with quantile-based evaluation and lexicographic preferences.
The paper tackles the problem of evaluating policies in Markov Decision Processes (MDPs) using quantiles for risk aversion and multiple objectives, such as balancing speed and safety in autonomous driving, by reformulating it as a multi-objective MDP with lexicographic preference and proposing the FLMDP algorithm to compute optimal policies.
In most common settings of Markov Decision Process (MDP), an agent evaluate a policy based on expectation of (discounted) sum of rewards. However in many applications this criterion might not be suitable from two perspective: first, in risk aversion situation expectation of accumulated rewards is not robust enough, this is the case when distribution of accumulated reward is heavily skewed; another issue is that many applications naturally take several objective into consideration when evaluating a policy, for instance in autonomous driving an agent needs to balance speed and safety when choosing appropriate decision. In this paper, we consider evaluating a policy based on a sequence of quantiles it induces on a set of target states, our idea is to reformulate the original problem into a multi-objective MDP problem with lexicographic preference naturally defined. For computation of finding an optimal policy, we proposed an algorithm \textbf{FLMDP} that could solve general multi-objective MDP with lexicographic reward preference.