CVMar 12, 2025

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, Saining Xie

arXiv:2503.09595v130.140 citationsh-index: 11Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses the need for more reliable physics simulation in video generation models, which is incremental as it builds on existing pre-trained models to enhance a specific physical task.

The paper tackled the problem of improving physical accuracy in video diffusion models by focusing on object freefall, showing that fine-tuning on simulated videos and a novel reward modeling procedure can induce dropping behavior, though limitations in generalization and distribution modeling were revealed.

Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.

View on arXiv PDF Code

Similar