CLAILGFeb 21, 2024

Ranking Large Language Models without Ground Truth

arXiv:2402.14860v429 citationsh-index: 33ACL
Originality Incremental advance
AI Analysis

This provides a low-resource solution for ranking LLMs, addressing the cost and unreliability of human or pairwise evaluations, though it is incremental as it builds on existing evaluation challenges.

The paper tackles the problem of ranking large language models without ground truth by using triplets of models to evaluate each other, achieving reliable recovery of close-to-true rankings in experiments on tasks like summarization and dialog.

Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes