ROLGSep 29, 2025

From Code to Action: Hierarchical Learning of Diffusion-VLM Policies

arXiv:2509.24917v11 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses robotic manipulation challenges, particularly in complex, long-horizon tasks, by leveraging structured supervision from open-source robotic APIs, though it appears incremental as it builds on existing methods like VLMs and diffusion policies.

The paper tackles the problem of limited generalization and data scarcity in imitation learning for robotic manipulation by introducing a hierarchical framework that combines code-generating vision-language models with low-level diffusion policies, resulting in improved generalization and interpretable policy decomposition compared to flat policies.

Imitation learning for robotic manipulation often suffers from limited generalization and data scarcity, especially in complex, long-horizon tasks. In this work, we introduce a hierarchical framework that leverages code-generating vision-language models (VLMs) in combination with low-level diffusion policies to effectively imitate and generalize robotic behavior. Our key insight is to treat open-source robotic APIs not only as execution interfaces but also as sources of structured supervision: the associated subtask functions - when exposed - can serve as modular, semantically meaningful labels. We train a VLM to decompose task descriptions into executable subroutines, which are then grounded through a diffusion policy trained to imitate the corresponding robot behavior. To handle the non-Markovian nature of both code execution and certain real-world tasks, such as object swapping, our architecture incorporates a memory mechanism that maintains subtask context across time. We find that this design enables interpretable policy decomposition, improves generalization when compared to flat policies and enables separate evaluation of high-level planning and low-level control.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes