WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
This provides a scalable and replicable benchmark for researchers and developers to assess and improve LLM narrative generation, though it is incremental as it builds on existing evaluation methods.
The authors tackled the challenge of evaluating long-form storytelling in LLMs by introducing WebNovelBench, a benchmark using over 4,000 Chinese web novels and a synopsis-to-story task, which effectively differentiated human and LLM content and ranked 24 state-of-the-art models.
Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.