CLOct 16, 2024

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

arXiv:2410.12265v21 citationsh-index: 52
Originality Highly original
AI Analysis

This addresses the need for efficient and reliable evaluation methods for LLMs, offering a cost-effective solution that could impact the broader AI community.

The paper tackles the problem of evaluating large language models (LLMs) by proposing Auto-PRE, an automatic peer-review framework that reduces costs and achieves state-of-the-art performance on tasks like summarization, non-factoid QA, and dialogue generation.

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes