BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

arXiv:2605.3090070.7h-index: 3

AI Analysis

This benchmark identifies critical weaknesses in physical reasoning for current MLLMs, which is important for researchers developing more capable multimodal AI systems.

This paper introduces BilliardPhys-Bench, a new benchmark to evaluate multimodal large language models (MLLMs) on physical reasoning and visual dynamics in synthetic billiards environments. The benchmark assesses collision prediction, wall bounce reasoning, and final position estimation, revealing that MLLM performance degrades with increased simulation time and scene complexity, and they exhibit a "stasis bias" where models predict no interaction when physical outcomes are difficult to infer.

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

View on arXiv PDF

Similar