LGAIMay 17, 2025

Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs

arXiv:2505.12049v1h-index: 24
Originality Highly original
AI Analysis

This work addresses a foundational issue in reinforcement learning for scenarios where multi-objective preferences cannot be captured by scalar rewards, offering a theoretical framework with practical implications.

The paper tackles the problem of representing preferences in Markov Decision Processes when scalar rewards are insufficient, identifying conditions requiring 2-dimensional reward functions and characterizing them, with results showing optimal policies retain desirable properties unlike in Constrained MDPs.

Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner's foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting -- another common multiobjective setting -- they do not.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes