Humanoid World Models: Open World Foundation Models for Humanoid Robotics
This work addresses the problem of humanoid robot reasoning and planning in open-world environments for robotics researchers and practitioners, but it is incremental as it builds on existing world model concepts with architectural optimizations.
The authors tackled the challenge of enabling humanoid robots to operate in complex open-world settings by developing Humanoid World Models (HWM), a family of lightweight, open-source models that forecast future egocentric video from control tokens, achieving a 33-53% reduction in model size with minimal performance impact.
Humanoid robots, with their human-like form, are uniquely suited for interacting in environments built for people. However, enabling humanoids to reason, plan, and act in complex open-world settings remains a challenge. World models, models that predict the future outcome of a given action, can support these capabilities by serving as a dynamics model in long-horizon planning and generating synthetic data for policy learning. We introduce Humanoid World Models (HWM), a family of lightweight, open-source models that forecast future egocentric video conditioned on humanoid control tokens. We train two types of generative models, Masked Transformers and Flow-Matching, on 100 hours of humanoid demonstrations. Additionally, we explore architectural variants with different attention mechanisms and parameter-sharing strategies. Our parameter-sharing techniques reduce model size by 33-53% with minimal impact on performance or visual fidelity. HWMs are designed to be trained and deployed in practical academic and small-lab settings, such as 1-2 GPUs.