AICLJul 18, 2024

SciCode: A Research Coding Benchmark Curated by Scientists

PrincetonUW
arXiv:2407.13168v1115 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the problem of developing high-quality evaluations for AI researchers and scientists, as it provides a challenging benchmark for assessing language models' capabilities in scientific coding, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the challenge of evaluating language models on realistic scientific coding tasks by creating SciCode, a benchmark curated by scientists across 16 natural science subfields, containing 338 subproblems from 80 main problems. The best-performing model, Claude3.5-Sonnet, solved only 4.6% of problems in the most realistic setting.

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes