LGNov 1, 2024

Direct Preference Optimization for Primitive-Enabled Hierarchical Reinforcement Learning

Utsav Singh, Souradip Chakraborty, Wesley A. Suttle, Brian M. Sadler, Derrik E. Asher, Anit Kumar Sahu, Mubarak Shah, Vinay P. Namboodiri, Amrit Singh Bedi

arXiv:2411.00361v36.42 citationsh-index: 29

Originality Highly original

AI Analysis

This addresses fundamental issues in HRL for complex robotic tasks, representing a strong incremental advance.

The paper tackles the challenges of non-stationarity and infeasible subgoals in hierarchical reinforcement learning by introducing DIPPER, a framework that uses direct preference optimization to train the higher-level policy, achieving up to 40% improvement over state-of-the-art baselines in sparse reward scenarios.

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods often suffer from two fundamental challenges: (i) non-stationarity, caused by the changing behavior of the lower-level policy during training, which destabilizes higher-level policy learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. In this work, we introduce DIPPER, a novel HRL framework that formulates hierarchical policy learning as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy using preference feedback. By optimizing the higher-level policy with DPO, we decouple higher-level learning from the non-stationary lower-level reward signal, thus mitigating non-stationarity. To further address the infeasible subgoal problem, DIPPER incorporates a regularization that tries to ensure the feasibility of subgoal tasks within the capabilities of the lower-level policy. Extensive experiments on challenging robotic navigation and manipulation benchmarks demonstrate that DIPPER achieves up to 40\% improvement over state-of-the-art baselines in sparse reward scenarios, highlighting its effectiveness in overcoming longstanding limitations of HRL.

View on arXiv PDF

Similar