LGROFeb 13

Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

arXiv:2602.12636v1h-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of sample inefficiency in embodied RL for robotics, offering a novel method that reduces reliance on expert supervision, though it is incremental in leveraging existing video generation models.

The paper tackles the challenge of designing dense rewards for efficient reinforcement learning in embodied manipulation by proposing a framework that generates episodic guidance from a few expert videos and uses dual-granularity contrastive rewards, achieving improved sample efficiency and stable policy convergence across 18 diverse tasks without human annotations.

Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes