CLFeb 26, 2025

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

arXiv:2502.19064v22 citationsh-index: 6EMNLP
Originality Incremental advance
AI Analysis

This provides a novel methodology for assessing creative works like poetry, potentially broadening LLM applications in creative domains, though it is incremental as it adapts an existing technique to a new context.

This study tackled the problem of evaluating poetry by adapting the Consensual Assessment Technique for Large Language Models, demonstrating that LLMs like Claude-3-Opus significantly outperform non-expert human judges with a Spearman's Rank Correlation of 0.87 versus 0.38.

This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman's Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology's robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes