CVGROct 22, 2025

Improving the Physics of Video Generation with VJEPA-2 Reward Signal

arXiv:2510.21840v12 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses the issue of implausible video generation for AI and computer vision applications, but it is incremental as it builds on existing models.

The paper tackled the problem of limited physical understanding in state-of-the-art video generative models by leveraging VJEPA-2 as a reward signal, improving physics plausibility by approximately 6%.

This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes