AILGJun 25, 2020

SOAC: The Soft Option Actor-Critic Architecture

arXiv:2006.14363v18 citations
Originality Incremental advance
AI Analysis

This addresses a key problem in reinforcement learning for long-horizon tasks by improving option learning stability and exploration, though it is incremental as it builds on existing option frameworks.

The paper tackles the challenges of ineffective exploration and unstable updates in learning temporally-extended sub-tasks (options) in reinforcement learning, by introducing a novel off-policy approach based on maximum entropy and an information-theoretical intrinsic reward, resulting in significant outperformance over prior methods on Mujoco benchmark tasks and learning diverse, coherent options.

The option framework has shown great promise by automatically extracting temporally-extended sub-tasks from a long-horizon task. Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. However, existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. In this paper, we present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges. Our approach introduces an information-theoretical intrinsic reward for encouraging the identification of diverse and effective options. Meanwhile, we utilize a probability inference model to simplify the optimization problem as fitting optimal trajectories. Experimental results demonstrate that our approach significantly outperforms prior on-policy and off-policy methods in a range of Mujoco benchmark tasks while still providing benefits for transfer learning. In these tasks, our approach learns a diverse set of options, each of whose state-action space has strong coherence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes