ROApr 30

MotuBrain: An Advanced World Action Model for Robot Control

arXiv:2604.2779297.31 citations
Predicted impact top 4% in RO · last 90 daysOriginality Highly original
AI Analysis

For robotics researchers, MotuBrain provides a versatile world action model that improves real-time applicability and scalability across diverse robot embodiments.

MotuBrain introduces a unified multimodal generative model that jointly models video and action for robot control, achieving over 50x speedup for real-time deployment while supporting multiple inference modes across heterogeneous data.

Vision-Language-Action (VLA) models achieve strong semantic generalization but often lack fine-grained modeling of world dynamics. Recent work explores video generation models as a foundation for world modeling, leading to unified World Action Models (WAMs) that jointly model visual dynamics and actions. We present MotuBrain, a unified multimodal generative model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports multiple inference modes, including policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only and cross-embodiment robot data. To improve real-world applicability, MotuBrain introduces a unified multiview representation, explicit language-action coupling, and an efficient inference stack, achieving over 50x speedup for real-time deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes