CLAIGTMay 23, 2024

Eliciting Informative Text Evaluations with Large Language Models

arXiv:2405.15077v424 citationsh-index: 6EC
Originality Incremental advance
AI Analysis

This addresses the need for reliable feedback in domains like peer reviews and e-commerce, though it is incremental as it builds on existing peer prediction methods by incorporating LLMs.

The paper tackles the problem of motivating high-quality textual feedback by extending peer prediction mechanisms to text-based reports using large language models, showing that the proposed mechanisms can incentivize truthful reporting and differentiate review quality levels, with GSPPM penalizing LLM-generated reviews more effectively than GPPM on datasets like ICLR.

Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes