LGAIROOct 8, 2025

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

arXiv:2510.07092v14 citationsh-index: 47Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of world modeling for humanoid robotics by introducing benchmark-specific solutions, though it is incremental as it adapts existing methods to a new challenge.

The authors tackled the 1X World Model Challenge by adapting a video generation foundation model for future frame prediction and training a transformer model for latent code prediction, achieving 23.0 dB PSNR in sampling and Top-500 CE of 6.6386 in compression to win both tracks.

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes