LGFeb 9, 2024

Value function interference and greedy action selection in value-based multi-objective reinforcement learning

arXiv:2402.06266v14 citationsh-index: 27
Originality Synthesis-oriented
AI Analysis

This addresses a specific technical challenge in MORL for researchers, but it is incremental as it builds on existing methods without introducing a new paradigm.

The paper tackles the problem of value function interference in multi-objective reinforcement learning, where similar utility levels from varying vector-values lead to sub-optimal policies, and shows that avoiding random tie-breaking in greedy action selection can partially mitigate this issue.

Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's utility with respect to the different objectives. However, as we demonstrate here, if the user's utility function maps widely varying vector-values to similar levels of utility, this can lead to interference in the value-function learned by the agent, leading to convergence to sub-optimal policies. This will be most prevalent in stochastic environments when optimising for the Expected Scalarised Return criterion, but we present a simple example showing that interference can also arise in deterministic environments. We demonstrate empirically that avoiding the use of random tie-breaking when identifying greedy actions can ameliorate, but not fully overcome, the problems caused by value function interference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes