CLOct 22, 2024

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

arXiv:2410.16848v214 citationsh-index: 12Has CodeNAACL
Originality Incremental advance
AI Analysis

This addresses the need for reliable evaluation benchmarks for long-context LLMs, though it is incremental as it builds on existing evaluation concerns.

The authors tackled the problem of evaluating large language models on long-context tasks by introducing a new metric called information coverage (IC) and a benchmark named ETHIC, which revealed significant performance drops in contemporary models across 1,986 test instances in domains like books and medicine.

Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes