CEJun 1

Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

Shashwat Sourav, Tanjin. He, Maria K. Y. Chan, Anubhav Jain, Tirthankar Ghosal

arXiv:2606.0225888.3

AI Analysis

For researchers developing AI co-scientists in materials science, this benchmark provides a more rigorous evaluation of hypothesis generation, addressing a gap in current benchmarks that focus on data analysis or literature summarization.

The paper introduces Matter to Mechanism, a benchmark for evaluating AI co-scientists on problem-to-hypothesis reasoning in materials and battery research, containing 2,645 instances with structured annotations and a composite metric suite. Evaluation of several AI systems reveals interpretable differences not captured by standard text-similarity metrics, and the composite score is more robust to adversarial attacks than individual metrics.

AI co-scientists are increasingly used for scientific discovery, but current evaluations still do not test them on a key task: moving from a concrete scientific or technological problem to a plausible, mechanism-grounded solution hypothesis. This gap is especially important in materials science and, in particular, battery research, where a useful proposal must identify the relevant failure mode, propose a credible intervention, and explain why that intervention should improve the target property. We introduce Matter to Mechanism, a benchmark for evaluating AI co-scientists on problem-to-hypothesis reasoning in materials science, with a focus on battery materials research. The benchmark contains 2,645 instances derived from scientific publications. Each instance includes a structured problem statement, a candidate solution hypothesis, an explicit reasoning trace, and domain-grounded annotations such as material system, component, failure mode, intervention, mechanism, target property, and claimed outcome. We also introduce a metric suite that measures reasoning fidelity, problem alignment, mechanistic specificity, novelty, plausibility, and problem decomposition quality, and combine them into a composite score. Using this framework, we evaluate several AI co-scientist systems and show that Matter to Mechanism reveals interpretable system differences that are only partially recovered by standard text-similarity metrics. We further show through adversarial stress tests that the aggregate score is more stable than individual metric dimensions under superficial gaming attacks.

View on arXiv PDF

Similar