ElicitationGPT: Text Elicitation Mechanisms via Language Models
This addresses the need for automated, provable scoring mechanisms in information elicitation, potentially useful for AI applications, but it is incremental as it builds on existing scoring rule theory with a new application to text.
The paper tackled the problem of scoring elicited text against ground truth by reducing it to forecast elicitation using large language models like ChatGPT, and empirically showed alignment with human preferences on a peer-review dataset compared to manual instructor scores.
Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence technologies with provable guarantees.