AIMar 26, 2025

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

arXiv:2503.21821v133 citationsh-index: 28ACL
Originality Synthesis-oriented
AI Analysis

This work addresses the need for robust evaluation of AI models in advanced physics problem-solving, though it is incremental as it applies existing benchmarking methods to a new domain.

The researchers tackled the problem of evaluating foundation models on university-level physics by creating PHYSICS, a benchmark with 1,297 expert-annotated problems across six core physics areas, and found that even the most advanced model (o3-mini) achieved only 59.9% accuracy, revealing substantial limitations in solving high-level scientific problems.

We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes