AINov 10, 2017

Learning with Options that Terminate Off-Policy

Anna Harutyunyan, Peter Vrancx, Pierre-Luc Bacon, Doina Precup, Ann Nowe

arXiv:1711.03817v217.728 citations

Originality Incremental advance

AI Analysis

This addresses a fundamental dilemma in option-based reinforcement learning, offering a method to improve flexibility and performance without requiring ideal option sets, though it appears incremental as it builds on existing off-policy learning frameworks.

The paper tackles the trade-off between learning efficiency and solution quality in reinforcement learning with options by proposing a new algorithm, Q(β), that decouples behavior and target terminations, enabling learning with respect to any termination condition regardless of actual option termination.

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

View on arXiv PDF

Similar