CL AINov 4, 2024

Evaluating Creative Short Story Generation in Humans and Large Language Models

Mete Ismayilzada, Claire Stevenson, Lonneke van der Plas

arXiv:2411.02316v511.924 citationsh-index: 8Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating creative capabilities in AI for researchers and developers, though it is incremental as it builds on existing story generation tasks.

The study systematically compared creativity in short story generation between 60 large language models (LLMs) and 60 humans, finding that LLMs produced stylistically complex stories but scored lower in novelty, surprise, and diversity than average human writers, with expert ratings aligning with automated metrics.

Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.

View on arXiv PDF Code

Similar