CLAug 28, 2023

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

arXiv:2308.14353v1133 citationsh-index: 28
Originality Synthesis-oriented
AI Analysis

This provides a systematic evaluation tool for researchers and developers working on Chinese and English LLMs, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the need for comprehensive evaluation of large language models by proposing the ZhuJiu benchmark, which assesses 10 mainstream LLMs across 7 ability dimensions and 51 tasks, with a focus on Chinese language capabilities and avoiding data leakage.

The unprecedented performance of large language models (LLMs) requires comprehensive and accurate evaluation. We argue that for LLMs evaluation, benchmarks need to be comprehensive and systematic. To this end, we propose the ZhuJiu benchmark, which has the following strengths: (1) Multi-dimensional ability coverage: We comprehensively evaluate LLMs across 7 ability dimensions covering 51 tasks. Especially, we also propose a new benchmark that focuses on knowledge ability of LLMs. (2) Multi-faceted evaluation methods collaboration: We use 3 different yet complementary evaluation methods to comprehensively evaluate LLMs, which can ensure the authority and accuracy of the evaluation results. (3) Comprehensive Chinese benchmark: ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English. (4) Avoiding potential data leakage: To avoid data leakage, we construct evaluation data specifically for 37 tasks. We evaluate 10 current mainstream LLMs and conduct an in-depth discussion and analysis of their results. The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/ and we also provide a demo video at https://youtu.be/qypkJ89L1Ic.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes