CLMar 6, 2024

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

arXiv:2403.03514v232 citationsh-index: 13Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This addresses the problem of underdeveloped evaluation for Chinese long-context LLMs, offering a new benchmark for researchers and developers, though it is incremental as it builds on existing evaluation concepts.

The authors tackled the lack of benchmarks for evaluating long-context large language models in Chinese by introducing CLongEval, a comprehensive benchmark with 7 tasks and 7,267 examples, and used it to assess 8 models, providing analysis on challenging capabilities.

Developing Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs are released.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes