CLJun 17, 2024

Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

arXiv:2406.11629v622 citations
Originality Incremental advance
AI Analysis

This addresses accuracy and reliability concerns for researchers and practitioners using LLMs as evaluators, but it is an incremental improvement on existing in-context learning methods.

The study tackled the problem of biases in using Large Language Models (LLMs) as evaluators by proposing many-shot in-context learning prompt templates, finding that GPT-4o performs better with many-shot regimes and the Many-Shot with Reference template outperforms others.

Utilizing Large Language Models (LLMs) as evaluators to assess the performance of LLMs has garnered attention. However, this kind of evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this problem, we propose and study two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated evaluation rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes. Furthermore, when using GPT-4o as an evaluator in the many-shot regime, adopting MSwR as the prompt template performs better than MSoR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes