LG AIFeb 6, 2022

Approximate Policy Iteration with Bisimulation Metrics

arXiv:2202.02881v38.712 citationsHas Code

Originality Incremental advance

AI Analysis

This work provides incremental improvements to policy iteration methods for reinforcement learning, with potential applications in actor-critic algorithms and state representation learning.

The authors tackled the problem of improving approximate policy iteration (API) in Markov decision processes by using bisimulation metrics for state discretization and conservative policy updates, resulting in better theoretical performance bounds and validated empirical results for finite MDPs.

Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation (VFA). In this work we first prove that bisimulation and $π$-bisimulation metrics can be defined via a more general class of Sinkhorn distances, which unifies various state similarity metrics used in recent work. Then we describe an approximate policy iteration (API) procedure that uses a bisimulation-based discretization of the state space for VFA and prove asymptotic performance bounds. Next, we bound the difference between $π$-bisimulation metrics in terms of the change in the policies themselves. Based on these results, we design an API($α$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. We discuss how such API procedures map onto practical actor-critic methods that use bisimulation metrics for state representation learning. Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.

View on arXiv PDF Code

Similar