CLJun 13, 2024

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, Keyan Ding

arXiv:2406.09098v414.912 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a standard benchmark for assessing scientific capabilities in LLMs, addressing a gap for researchers and developers, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the lack of comprehensive benchmarks for evaluating scientific knowledge in large language models by introducing SciKnowEval, a dataset with 28K multi-level questions across four sciences, and found that while proprietary models lead, significant challenges persist in reasoning and application.

Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain -- particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.

View on arXiv PDF Code

Similar