CLJun 13, 2024

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

arXiv:2406.09098v412 citationsHas Code
AI Analysis

This provides a standard benchmark for assessing scientific capabilities in LLMs, addressing a gap for researchers and developers, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the lack of comprehensive benchmarks for evaluating scientific knowledge in large language models by introducing SciKnowEval, a dataset with 28K multi-level questions across four sciences, and found that while proprietary models lead, significant challenges persist in reasoning and application.

Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain -- particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes