CLAug 25, 2023

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

arXiv:2308.13149v2156 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation benchmarks for LLMs in scientific research, though it is incremental as it builds on existing benchmark designs.

The authors tackled the problem of evaluating Large Language Models (LLMs) for scientific research by proposing SciEval, a multi-disciplinary benchmark that includes both objective and subjective questions to address data leakage and lack of subjective evaluation, with experiments showing GPT-4 achieves state-of-the-art performance but has substantial room for improvement, especially on dynamic questions.

Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes