AIMay 21, 2025

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang

arXiv:2505.15929v226.628 citationsh-index: 16

Originality Incremental advance

AI Analysis

This addresses the problem of evaluating physical reasoning capabilities in AI models, which is crucial for advancing AI intelligence, though it is incremental as it builds on existing benchmark methodologies.

The authors tackled the lack of benchmarks for physical reasoning in AI by introducing PhyX, a large-scale multimodal benchmark, and found that state-of-the-art models like GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5% to 45.8% accuracy, with performance gaps exceeding 29% compared to human experts.

Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation. More details are available on our project page: https://phyx-bench.github.io/.

View on arXiv PDF

Similar