The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario
This addresses the problem of assessing LLM creativity for researchers and developers, though it is incremental as it applies existing evaluation methods to a new scenario.
The paper evaluated state-of-the-art LLMs on a creative writing task using a unique prompt to avoid data leakage, finding that some commercial LLMs matched or slightly outperformed human writers in most dimensions, while open-source models lagged behind.
This is a summary of the paper "A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing", which was published in Findings of EMNLP 2023. We evaluate a range of recent state-of-the-art, instruction-tuned large language models (LLMs) on an English creative writing task, and compare them to human writers. For this purpose, we use a specifically-tailored prompt (based on an epic combat between Ignatius J. Reilly, main character of John Kennedy Toole's "A Confederacy of Dunces", and a pterodactyl) to minimize the risk of training data leakage and force the models to be creative rather than reusing existing stories. The same prompt is presented to LLMs and human writers, and evaluation is performed by humans using a detailed rubric including various aspects like fluency, style, originality or humor. Results show that some state-of-the-art commercial LLMs match or slightly outperform our human writers in most of the evaluated dimensions. Open-source LLMs lag behind. Humans keep a close lead in originality, and only the top three LLMs can handle humor at human-like levels.