LGApr 17Code
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical ConstraintsXinge Liu, Terry Jingchen Zhang, Bernhard Schölkopf et al.
The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/Gudmorning2025/Stargazer and https://gudmorning2025.github.io/Stargazer, respectively.
EPJun 20, 2019Code
Automated crater shape retrieval using weakly-supervised deep learningMohamad Ali-Dib, Kristen Menou, Alan P. Jackson et al.
Crater ellipticity determination is a complex and time consuming task that so far has evaded successful automation. We train a state of the art computer vision algorithm to identify craters in Lunar digital elevation maps and retrieve their sizes and 2D shapes. The computational backbone of the model is MaskRCNN, an "instance segmentation" general framework that detects craters in an image while simultaneously producing a mask for each crater that traces its outer rim. Our post-processing pipeline then finds the closest fitting ellipse to these masks, allowing us to retrieve the crater ellipticities. Our model is able to correctly identify 87\% of known craters in the longitude range we hid from the network during training and validation (test set), while predicting thousands of additional craters not present in our training data. Manual validation of a subset of these "new" craters indicates that a majority of them are real, which we take as an indicator of the strength of our model in learning to identify craters, despite incomplete training data. The crater size, ellipticity, and depth distributions predicted by our model are consistent with human-generated results. The model allows us to perform a large scale search for differences in crater diameter and shape distributions between the lunar highlands and maria, and we exclude any such differences with a high statistical significance. The predicted test set catalogue and trained model are available here: https://github.com/malidib/Craters_MaskRCNN/.
AIJan 30, 2025
Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for AgentsNolan Koblischke, Hyunseok Jang, Kristen Menou et al.
Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.
AIDec 4, 2023
Physics simulation capabilities of LLMsMohamad Ali-Dib, Kristen Menou
[Abridged abstract] Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute $\sim 50$ original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more "educational" Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40\% of the solutions could plausibly get a passing grade. About $70-90 \%$ of the code lines produced are necessary, sufficient and correct (coding \& physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.