CLAIOct 22, 2023

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

NVIDIA
arXiv:2310.14424v122 citationsh-index: 57
Originality Incremental advance
AI Analysis

This work addresses the challenge of high costs and time in human evaluations for large language models, offering an incremental improvement in efficiency for researchers and practitioners.

The paper tackles the problem of reducing the resource-intensive nature of human evaluation for large language models by prioritizing data instances that best distinguish between models, resulting in up to a 54% reduction in indecisive outcomes compared to random sampling when focusing on the top-20 percentile of prioritized instances.

Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizing data instances which most effectively distinguish between models?" We evaluate several metric-based methods and find that these metrics enhance the efficiency of human evaluations by minimizing the number of required annotations, thus saving time and cost, while ensuring a robust performance evaluation. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54% compared to a random sample when focusing on the top-20 percentile of prioritized instances. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes