Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
This provides a novel methodology for assessing creative works like poetry, potentially broadening LLM applications in creative domains, though it is incremental as it adapts an existing technique to a new context.
This study tackled the problem of evaluating poetry by adapting the Consensual Assessment Technique for Large Language Models, demonstrating that LLMs like Claude-3-Opus significantly outperform non-expert human judges with a Spearman's Rank Correlation of 0.87 versus 0.38.
This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman's Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology's robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.