Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection
This addresses the limitation of VLMs in detecting dynamic anomalies like irregular rotations, which is important for applications requiring causal understanding of physics.
The paper tackles the problem of physics-grounded anomaly detection in vision-language models by introducing a physics-informed instruction tuning framework that encodes physical priors through multi-turn dialogues. The result is 96.7% AUROC on the Phys-AD benchmark, substantially outperforming prior SOTA (66.9%).
Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.