LGCLHCNov 21, 2024

A Framework for Evaluating LLMs Under Task Indeterminacy

arXiv:2411.13760v13 citationsh-index: 37
Originality Incremental advance
AI Analysis

This work addresses a methodological issue in LLM evaluation for the research community, but it is incremental as it builds on existing evaluation frameworks.

The paper tackles the problem that LLM evaluations often assume a single correct response per item, which underestimates performance when tasks are ambiguous or vague, and shows through a synthetic experiment that this assumption leads to underestimation of true performance.

Large language model (LLM) evaluations often assume there is a single correct response -- a gold label -- for each item in the evaluation corpus. However, some tasks can be ambiguous -- i.e., they provide insufficient information to identify a unique interpretation -- or vague -- i.e., they do not clearly indicate where to draw the line when making a determination. Both ambiguity and vagueness can cause task indeterminacy -- the condition where some items in the evaluation corpus have more than one correct response. In this paper, we develop a framework for evaluating LLMs under task indeterminacy. Our framework disentangles the relationships between task specification, human ratings, and LLM responses in the LLM evaluation pipeline. Using our framework, we conduct a synthetic experiment showing that evaluations that use the "gold label" assumption underestimate the true performance. We also provide a method for estimating an error-adjusted performance interval given partial knowledge about indeterminate items in the evaluation corpus. We conclude by outlining implications of our work for the research community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes