RO LGMar 6, 2025

Refined Policy Distillation: From VLA Generalists to RL Experts

Tobias Jülg, Wolfram Burgard, Florian Walter

arXiv:2503.05833v218.016 citationsh-index: 8Has CodeIROS

Originality Incremental advance

AI Analysis

This addresses the problem of improving VLA performance for robotics manipulation, though it appears incremental as it builds on existing distillation and RL methods.

The paper tackles the performance gap between Vision-Language-Action Models (VLAs) and expert policies by introducing Refined Policy Distillation (RPD), which distills VLAs into compact expert policies using RL and behavioral cloning, resulting in policies that outperform VLAs in manipulation tasks and achieve faster convergence than RL baselines.

Vision-Language-Action Models (VLAs) have demonstrated remarkable generalization capabilities in real-world experiments. However, their success rates are often not on par with expert policies, and they require fine-tuning when the setup changes. In this work, we introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning (RL)-based policy refinement method that bridges this performance gap through a combination of on-policy RL with behavioral cloning. The core idea of RPD is to distill and refine VLAs into compact, high-performing expert policies by guiding the student policy during RL exploration using the actions of a teacher VLA, resulting in increased sample efficiency and faster convergence. We complement our method by fine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD in simulation. While this is a key requirement for applying RL, it also yields new insights beyond existing studies on VLA performance in real-world settings. Our experimental results across various manipulation tasks show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings, while also achieving faster convergence than the RL baseline. Our approach is even robust to changes in camera perspective and can generalize to task variations that the underlying VLA cannot solve. Our code, dataset, VLA checkpoints, and videos are available at https://refined-policy-distillation.github.io

View on arXiv PDF

Similar