SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
This provides a dynamic benchmark for researchers in AI and scientific reasoning to assess and improve model robustness and interpretability, though it is incremental as it builds on existing benchmarking approaches.
The authors introduced SymPyBench, a large-scale synthetic benchmark of 15,045 university-level physics problems with executable Python code, to tackle the challenge of evaluating scientific reasoning in AI systems, resulting in the development of three novel evaluation metrics that reveal strengths and limitations in state-of-the-art models.
We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems