ROCVOct 22, 2025

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

arXiv:2510.19430v130 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the scalability and generalization limitations in generalist robot training, though it appears incremental as it builds on existing world model and VLA concepts.

The paper tackles the challenge of expensive real-world robot data collection for Vision-Language-Action models by introducing GigaBrain-0, which uses world model-generated data to reduce reliance on real data and improve cross-task generalization, achieving substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks.

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes