AI ROOct 7, 2025

MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides

arXiv:2510.05580v19.63 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient and scalable adaptation for embodied AI agents, offering a method to reduce resource demands while enhancing generalization, though it is incremental as it builds on existing VLA and meta-learning approaches.

The paper tackles the problem of Vision-Language-Action models requiring task-specific fine-tuning and poor generalization by proposing MetaVLA, a unified post-training framework that improves efficiency and performance; it achieves up to 8.0% higher accuracy on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76% on the LIBERO benchmark.

Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

View on arXiv PDF

Similar