LGAIROApr 15

Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

arXiv:2604.1373313.0h-index: 25
Predicted impact top 48% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For robotic manipulation tasks with sparse rewards, VLAJS offers a practical way to leverage pretrained VLA models to accelerate RL without requiring demonstrations or continuous teacher queries.

The paper proposes VLAJS, a method that uses vision-language-action models to provide sparse action guidance for on-policy reinforcement learning, improving exploration and credit assignment in robotic manipulation. VLAJS reduces required environment interactions by over 50% in several tasks and achieves zero-shot sim-to-real transfer.

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes