CV LG RONov 30, 2025

MM-ACT: Learn from Multimodal Parallel Generation to Act

Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, Dong Liu, Xiaokang Yang

arXiv:2512.00975v115.56 citationsh-index: 12Has Code

Originality Highly original

AI Analysis

This work addresses the problem of enabling robots to perform diverse tasks through multimodal understanding and generation, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of developing a generalist robotic policy by introducing MM-ACT, a unified Vision-Language-Action model that integrates text, image, and action in a shared token space, achieving success rates of 96.3% on LIBERO simulation, 72.0% on real Franka tasks, and 52.38% on RoboTwin2.0 bimanual tasks with a 9.25% gain from cross-modal learning.

A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

View on arXiv PDF Code

Similar