CLMar 21, 2025

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

arXiv:2503.17136v11 citationsh-index: 11
Originality Highly original
AI Analysis

This work addresses the challenge of subjective multi-annotator ratings in story evaluation for NLP researchers, offering a novel method to improve evaluation accuracy.

The paper tackled the problem of evaluating creative text like human-written stories by addressing suboptimal results from self-consistency reasoning methods, proposing Chain-of-Keywords (CoKe) to generate keywords before rationales for rating predictions. The result showed that CoKe-based models reached human-level performance, significantly outperformed GPT-4 with a 2x boost in correlation with human annotators, and required drastically fewer parameters.

Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $\textbf{C}$hain-$\textbf{o}$f-$\textbf{Ke}$ywords (CoKe), that generates a sequence of keywords $\textit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes