CLDec 31, 2024

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, Chris Kedzie

arXiv:2501.00274v136.5103 citationsh-index: 19Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the challenge of reliable automated text evaluation for dialogue systems, offering a calibrated approach that improves prediction accuracy, though it is incremental as it builds on existing LLM and rubric-based methods.

The paper tackles the problem of automated evaluation of natural language texts by introducing a framework that uses a manually constructed rubric and LLM predictions, combined via a neural network to predict human annotations, achieving an RMS error of less than 0.5 for overall user satisfaction, a 2x improvement over the baseline.

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.

View on arXiv PDF Code

Similar