CVLGROJul 17, 2025

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

arXiv:2507.12768v121 citations
Originality Incremental advance
AI Analysis

This work addresses scalability and efficiency issues for researchers and practitioners in robotics and embodied AI, though it is incremental as it builds on existing VLA models.

The paper tackles the problem of high data acquisition costs and limited generalization in vision-language-action models for bimanual manipulation by introducing a task-agnostic action paradigm and a self-supervised framework, resulting in a 30x acceleration in data collection and 51% improvement in test accuracy with 30-40% higher success rates in downstream tasks.

Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30\times $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes