LGJun 21, 2024

A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning

Gianluca Drappo, Alberto Maria Metelli, Marcello Restelli

arXiv:2406.15124v22.6

Originality Incremental advance

AI Analysis

This work addresses a theoretical gap in hierarchical reinforcement learning for researchers, though it is incremental as it builds on prior option-based methods.

The paper tackles the lack of theoretical understanding in hierarchical reinforcement learning when both high-level and low-level policies are learned simultaneously, presenting a meta-algorithm that alternates between regret minimization at different temporal abstractions and deriving bounds to show when hierarchical approaches are provably preferable without pre-trained options.

Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the \emph{option} framework, prior research has devised efficient algorithms for scenarios where options are fixed, and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, we present a meta-algorithm alternating between regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we treat the problem as a Semi-Markov Decision Process (SMDP), with fixed low-level policies, while at a lower level, inner option policies are learned with a fixed high-level policy. The bounds derived are compared with the lower bound for non-hierarchical finite-horizon problems, allowing to characterize when a hierarchical approach is provably preferable, even without pre-trained options.

View on arXiv PDF

Similar