CL AISep 17, 2024

Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs

Guillermo Marco, Luz Rello, Julio Gonzalo

arXiv:2409.11547v215.934 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This research addresses the challenge of balancing creativity, fluency, and coherence in AI-generated creative writing for applications in content creation and human-AI collaboration, though it is incremental as it builds on existing model comparisons.

The study tackled the problem of evaluating creative fiction writing abilities by comparing a fine-tuned small language model (BART-large) with human writers and large language models (GPT-3.5 and GPT-4o), finding that BART-large outperformed average human writers overall by 14% (2.11 vs. 1.85) and showed higher surprising associations (15% vs. 3% for GPT-4o).

In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.

View on arXiv PDF Code

Similar