MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models
This provides a domain-specific benchmark for assessing LLMs in materials science, which is incremental as it adapts existing evaluation methods to a new field.
The authors tackled the problem of evaluating large language models' (LLMs) problem-solving abilities in materials science by constructing MaterialBENCH, a college-level benchmark dataset based on university textbooks, and found performance differences among models like ChatGPT-3.5, ChatGPT-4, and Bard, with analysis of answer formats and system messages.
A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.