CVNov 25, 2025

Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

arXiv:2511.20280v11 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of physics-consistent video generation for AI and multimedia applications, representing an incremental improvement through a training-free, plug-and-play method.

The paper tackles the problem of video generation models producing results that misalign with real-world physical principles by proposing an iterative self-refinement framework using large language and vision-language models to provide physics-aware guidance, improving the Physics-IQ score from 56.31 to 62.38 on the PhyIQ benchmark.

Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

View on arXiv PDF

Similar