ROAICVLGMay 17, 2022

Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space

Stanford
arXiv:2205.08129v242 citationsh-index: 166
AI Analysis

This addresses the problem of time-consuming training for general-purpose robots in unstructured environments, though it appears incremental as it builds on existing goal-conditioned and hierarchical methods.

The paper tackles the challenge of training goal-conditioned reinforcement learning policies for long-horizon tasks by proposing Planning to Practice (PTP), which decomposes tasks hierarchically with a planner setting latent subgoals and uses offline pre-training followed by online fine-tuning, resulting in efficient task solving in simulation and real-world experiments.

General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve. Our approach is based on two key ideas. First, we decompose the goal-reaching problem hierarchically, with a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a low-level model-free policy. Second, we propose a hybrid approach which first pre-trains both the conditional subgoal generator and the policy on previously collected data through offline reinforcement learning, and then fine-tunes the policy via online exploration. This fine-tuning process is itself facilitated by the planned subgoals, which breaks down the original target task into short-horizon goal-reaching tasks that are significantly easier to learn. We conduct experiments in both the simulation and real world, in which the policy is pre-trained on demonstrations of short primitive behaviors and fine-tuned for temporally extended tasks that are unseen in the offline data. Our experimental results show that PTP can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes