CLAIFeb 25, 2024

Likelihood-based Mitigation of Evaluation Bias in Large Language Models

arXiv:2402.15987v431 citationsh-index: 31ACL
Originality Incremental advance
AI Analysis

This addresses a critical issue for researchers and practitioners using LLMs as automated metrics in natural language generation tasks, though it is incremental as it builds on existing in-context learning techniques.

The paper tackles the problem of likelihood bias in LLM-based evaluators, where models may overrate sentences with higher likelihoods due to superficial differences, and proposes a mitigation method using biased instances for in-context learning, resulting in significant improvement in evaluation performance with better correlation to human scores.

Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes