LG AI ROMay 20, 2025

Flattening Hierarchies with Policy Bootstrapping

arXiv:2505.14975v211.43 citationsh-index: 2

Originality Highly original

AI Analysis

This addresses the problem of high complexity and poor scalability in hierarchical RL for long-horizon goal-reaching tasks in robotics and control domains, offering a more efficient alternative.

The paper tackles the challenge of scaling offline goal-conditioned reinforcement learning to long-horizon tasks by introducing a flat policy training algorithm that bootstraps on subgoal-conditioned policies, eliminating the need for generative models over goal spaces. It matches or surpasses state-of-the-art methods on locomotion and manipulation benchmarks, scaling to complex tasks where prior approaches fail.

Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/

View on arXiv PDF

Similar