ROCVMay 21, 2025

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

arXiv:2505.15660v325 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of achieving general-purpose robotic manipulation in open-world settings, though it is incremental as it builds on existing VLA and LLM methods.

The paper tackles the problem of cross-task generalization in vision-language-action models for robotic manipulation by introducing AGNOSTOS, a benchmark with 23 unseen tasks, and finds that current models struggle; it proposes X-ICM, a method using in-context demonstrations, which significantly improves performance over leading models.

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes