LGOCMLJul 22, 2020

Approximation Benefits of Policy Gradient Methods with Aggregated States

arXiv:2007.11684v30.007 citations
AI Analysis55

This provides theoretical justification for the robustness of policy gradient methods in reinforcement learning with state aggregation, which is incremental but clarifies a known bottleneck.

The paper tackles the problem of policy gradient methods being more robust to approximation errors in state-aggregated representations compared to approximate policy iteration and value iteration, showing that policy gradient converges with per-period regret bounded by ε, while the others scale as ε/(1-γ).

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $ε$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $ε/(1-γ)$, where $γ$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes