Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
This addresses the challenge of costly and inaccurate LLM evaluation for researchers and developers, though it is incremental as it builds on existing human evaluation and ranking techniques.
The authors tackled the problem of expensive and unreliable evaluation of large language models (LLMs) by proposing a sample-efficient human evaluation method based on maximum discrepancy competition, which recovers gold-standard model rankings with a handful of selected instructions and reveals strengths and weaknesses across tasks like scientific knowledge and code generation.
Reliable evaluation of large language models (LLMs) is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .