LGDec 28, 2020

Blackwell Online Learning for Markov Decision Processes

arXiv:2012.14043v119 citations
AI Analysis

This work provides a new theoretical lens for understanding and solving MDPs, which could benefit researchers in reinforcement learning and online optimization.

This paper reinterprets Markov Decision Processes (MDPs) as an online optimization problem, where the policy is the decision variable and the value function is the payoff feedback. This framework leads to Blackwell value iteration for offline planning and Blackwell Q-learning for online learning, both of which are proven to converge to the optimal solution.

This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, from the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell $Q-$learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes