CLMar 3, 2025

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

IBM
arXiv:2503.01622v318 citationsh-index: 12ACL
Originality Incremental advance
AI Analysis

This addresses the need for more robust and meaningful evaluation practices in the LLM community, though it is incremental as it builds on prior work on prompt sensitivity.

The authors tackled the problem of LLM evaluation being sensitive to arbitrary prompt variations by creating DOVE, a large-scale dataset with over 250M prompt perturbations across multiple dimensions, leading to findings such as efficient methods for selecting high-performing prompts and reduced sensitivity with few-shot examples.

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes