AI MTRL-SCI CLJun 4, 2025

Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Peter Jansen, Samiah Hassan, Ruoyao Wang

arXiv:2506.04410v214.74 citationsh-index: 21Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of scaling automated scientific discovery systems by improving hypothesis filtering, though it is incremental as it focuses on benchmarking rather than a new method.

The authors tackled the problem of filtering automatically generated scientific hypotheses for feasibility to reduce costly experiments, by introducing Matter-of-Fact, a benchmark dataset of 8.4k claims from materials science articles, and showed that current models achieve only up to 72% accuracy on this task.

Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims, while operationalizing feasibility assessment as a temporally-filtered claim verification task using backtesting. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.

View on arXiv PDF Code

Similar