CVJan 15

Inference-time Physics Alignment of Video Generative Models with Latent World Models

arXiv:2601.10553v213 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses the issue of physics implausibility in video generation for applications requiring realistic content, though it is incremental as it builds on existing models with a novel inference strategy.

The paper tackles the problem of video generative models often violating basic physics principles by introducing an inference-time alignment method that uses a latent world model as a reward to steer denoising trajectories, achieving a 62.64% score and winning first place in the ICCV 2025 Perception Test PhysicsIQ Challenge, outperforming the previous state of the art by 7.42%.

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes