Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis
This work addresses the need for benchmarks to quantify LLMs' scientific hypothesis generation in materials science, though it is incremental as it builds on existing prompting methods.
The authors tackled the problem of evaluating Large Language Models' ability to generate valid scientific hypotheses for materials synthesis by introducing MatterMech, a benchmark across eight nanomaterial domains, and found that their principle-aware prompting method substantially outperforms standard Chain-of-Thought in accuracy and efficiency.
The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.