CLAIJul 6, 2023

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

arXiv:2307.02762v3134 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the challenge of reliable evaluation for LLMs, which is crucial for researchers and developers, though it is incremental by building on existing LLM-based evaluation approaches.

The paper tackles the problem of automatically evaluating and comparing large language models (LLMs) by addressing biases like self-enhancement and positional bias in existing LLM-based evaluation methods. It proposes peer rank and peer discussion algorithms, which achieve higher accuracy and better alignment with human judgments on benchmark datasets.

Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes