TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
This addresses the challenge of quantitatively evaluating physical realism in video generation, which is important for improving multimodal AI models, though it is incremental as it builds on existing VLMs.
The paper tackles the problem of assessing physical plausibility in videos by proposing TRAVL, a fine-tuning recipe for Video-Language Models, and ImplausiBench, a benchmark of 300 videos, resulting in improved performance as measured against human judgments and LLM-as-judge metrics.
Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.