CLJul 20, 2023

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

arXiv:2307.11088v3256 citationsh-index: 66Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of inconsistent evaluation for researchers and developers working on long context language models, though it is incremental as it builds on existing evaluation efforts.

The authors tackled the lack of standardized evaluation for long context language models by proposing L-Eval, a benchmark with 20 sub-tasks, 508 documents, and over 2,000 query-response pairs, and found that n-gram metrics poorly correlate with human judgment, advocating for enhanced evaluation methods.

Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k$\sim$200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes