MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge
This addresses the problem of evaluating domain-specific knowledge and complex reasoning in LLMs for materials science researchers, representing an incremental contribution by creating a new benchmark.
The paper tackles the lack of benchmarks for evaluating large language models (LLMs) in materials science by introducing MSQA, a comprehensive benchmark of 1,757 graduate-level questions, and finds significant performance gaps, with proprietary LLMs achieving up to 84.5% accuracy and open-source LLMs peaking around 60.5%.
Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.