CLNov 30, 2023

ArcMMLU: A Library and Information Science Benchmark for Large Language Models

Shitou Zhang, Zuchao Li, Xingshen Liu, Liming Yang, Ping Wang

arXiv:2311.18658v10.91 citationsh-index: 8Has Code

Originality Synthesis-oriented

AI Analysis

It addresses the need for domain-specific evaluation in Chinese LIS, though it is incremental as it adapts existing formats to a new domain.

The paper introduces ArcMMLU, a Chinese benchmark for evaluating large language models in Library & Information Science, finding that most models achieve over 50% accuracy but still have significant room for improvement.

In light of the rapidly evolving capabilities of large language models (LLMs), it becomes imperative to develop rigorous domain-specific evaluation benchmarks to accurately assess their capabilities. In response to this need, this paper introduces ArcMMLU, a specialized benchmark tailored for the Library & Information Science (LIS) domain in Chinese. This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science. Following the format of MMLU/CMMLU, we collected over 6,000 high-quality questions for the compilation of ArcMMLU. This extensive compilation can reflect the diverse nature of the LIS domain and offer a robust foundation for LLM evaluation. Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap, suggesting substantial headroom for refinement in LLM capabilities within the LIS domain. Further analysis explores the effectiveness of few-shot examples on model performance and highlights challenging questions where models consistently underperform, providing valuable insights for targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within the Chinese LIS domain and paves the way for future development of LLMs tailored to this specialized area.

View on arXiv PDF Code

Similar