CVNov 25, 2025

Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

arXiv:2511.20280v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of physics-consistent video generation for AI and multimedia applications, representing an incremental improvement through a training-free, plug-and-play method.

The paper tackles the problem of video generation models producing results that misalign with real-world physical principles by proposing an iterative self-refinement framework using large language and vision-language models to provide physics-aware guidance, improving the Physics-IQ score from 56.31 to 62.38 on the PhyIQ benchmark.

Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes