Imitating Graph-Based Planning with Goal-Conditioned Policies
This work addresses the problem of inefficient learning in long-horizon goal-conditioned RL for robotics or control applications, representing an incremental improvement over existing graph-based planning methods.
The paper tackles the sample-efficiency challenge in goal-conditioned reinforcement learning for long-horizon tasks by introducing a self-imitation scheme that distills subgoal-conditioned policies into target-goal-conditioned policies, along with stochastic subgoal skipping, resulting in significant empirical boosts in sample-efficiency across various control tasks.
Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.