MLLGMay 25

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

arXiv:2605.2608799.0Has Code
AI Analysis

For AI researchers evaluating LLM reasoning, this benchmark reveals that current models struggle with long-horizon experimental design and hypothesis revision, especially when latent variables are involved.

DiscoverPhysics benchmarks LLMs on discovering physics laws in simulated worlds with deliberately altered physics, finding that the strongest agents pass only half of the 22 worlds and consistently fail on those requiring uncovering latent structure. Open-source models lag behind commercial ones, and good predictive accuracy does not guarantee high explanation quality.

Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes