CVSep 25, 2025

NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, Stanley H. Chan

arXiv:2509.21309v127.228 citationsh-index: 5

Originality Highly original

AI Analysis

This work addresses the issue of unrealistic motions and limited control in text-to-video generation for applications requiring accurate physical simulations.

The paper tackled the problem of physical inconsistency and lack of controllability in text-to-video generation by proposing NewtonGen, a framework that integrates data-driven synthesis with learnable Neural Newtonian Dynamics, resulting in physically consistent video synthesis with precise parameter control.

A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.

View on arXiv PDF

Similar