Generalizable Hierarchical Skill Learning via Object-Centric Representation
This work addresses the challenge of sample efficiency and generalization in robot manipulation, offering a novel hierarchical framework that is incremental in its integration of existing foundation models.
The paper tackles the problem of improving policy generalization and sample efficiency in robot manipulation by introducing Generalizable Hierarchical Skill Learning (GSL), which uses object-centric skills to bridge high-level vision-language models and low-level policies, resulting in a 15.5% performance improvement on unseen tasks in simulation with only 3 demonstrations per task compared to baselines using 30 times more data.
We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill-object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.