CLOct 9, 2023

A Closer Look into Automatic Evaluation Using Large Language Models

arXiv:2310.05657v16.120 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of improving automatic evaluation methods for text quality, which is important for researchers and practitioners in NLP, though it is incremental as it builds on prior LLM-based evaluation approaches.

The paper analyzes how specific design choices in using large language models (LLMs) for automatic text evaluation affect alignment with human ratings, finding that asking LLMs to explain their ratings consistently improves correlation and achieves state-of-the-art results on two datasets.

Using large language models (LLMs) to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

View on arXiv PDF Code

Similar