CL AIMay 6

StoryAlign: Evaluating and Training Reward Models for Story Generation

Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

arXiv:2605.0483191.01 citationsh-index: 8Has Code

AI Analysis

For researchers working on story generation and alignment of LLMs with human preferences, this work provides a benchmark and a specialized reward model to address the underexplored problem of modeling subjective story preferences.

The paper introduces StoryRMB, the first benchmark for evaluating reward models on story generation preferences, finding that existing models achieve only 66.3% accuracy. They then develop StoryReward, a reward model trained on ~100,000 preference pairs, which achieves state-of-the-art performance on StoryRMB and improves human-aligned story selection in best-of-n decoding.

Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under-explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains $1,133$ high-quality, human-verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human-preferred stories, with the best model achieving only $66.3\%$ accuracy. To address this limitation, we construct roughly $100,000$ high-quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state-of-the-art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test-time scaling applications for best-of-n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research. Related code and data are available at https://github.com/THU-KEG/StoryReward.

View on arXiv PDF Code

Similar