ROCVDec 9, 2024

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

arXiv:2412.06782v336 citationsh-index: 18
Originality Highly original
AI Analysis

This addresses the problem of slow and constrained action generation in robotics, offering a more efficient and flexible paradigm, though it appears incremental as it builds on existing autoregressive and diffusion methods.

The paper tackles the inefficiency and limited flexibility of diffusion-based models in robotic visuomotor policy learning by introducing CARP, a coarse-to-fine autoregressive approach that achieves up to a 10% improvement in success rates and 10x faster inference compared to state-of-the-art policies.

In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes