LGNov 1, 2024

Direct Preference Optimization for Primitive-Enabled Hierarchical Reinforcement Learning

arXiv:2411.00361v32 citationsh-index: 29
Originality Highly original
AI Analysis

This addresses fundamental issues in HRL for complex robotic tasks, representing a strong incremental advance.

The paper tackles the challenges of non-stationarity and infeasible subgoals in hierarchical reinforcement learning by introducing DIPPER, a framework that uses direct preference optimization to train the higher-level policy, achieving up to 40% improvement over state-of-the-art baselines in sparse reward scenarios.

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods often suffer from two fundamental challenges: (i) non-stationarity, caused by the changing behavior of the lower-level policy during training, which destabilizes higher-level policy learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. In this work, we introduce DIPPER, a novel HRL framework that formulates hierarchical policy learning as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy using preference feedback. By optimizing the higher-level policy with DPO, we decouple higher-level learning from the non-stationary lower-level reward signal, thus mitigating non-stationarity. To further address the infeasible subgoal problem, DIPPER incorporates a regularization that tries to ensure the feasibility of subgoal tasks within the capabilities of the lower-level policy. Extensive experiments on challenging robotic navigation and manipulation benchmarks demonstrate that DIPPER achieves up to 40\% improvement over state-of-the-art baselines in sparse reward scenarios, highlighting its effectiveness in overcoming longstanding limitations of HRL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes