LGAIAug 26, 2020

Inverse Policy Evaluation for Value-based Sequential Decision-making

arXiv:2008.11329v11 citations
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in value-based reinforcement learning for AI researchers, but it is incremental as it builds on existing methods.

The paper tackles the problem of deriving behavior from value functions in reinforcement learning, especially when value iteration yields functions not corresponding to any policy, by proposing inverse policy evaluation to solve for a likely policy given a value function, and shows it is feasible with theoretical and empirical results.

Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes