CVAIMar 30, 2025

VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior

arXiv:2503.23368v334 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the issue of unrealistic dynamics in AI-generated videos for applications like simulation and content creation, but is incremental as it builds on existing video diffusion models.

The paper tackles the problem of video diffusion models often failing to produce physically plausible videos due to a lack of physics understanding, and proposes a two-stage framework incorporating vision and language informed physical prior, resulting in notable superiority over existing methods in generating physically plausible motion.

Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes