CL AI HCSep 26, 2025

Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty

arXiv:2509.22641v12 citationsh-index: 36Has Code

Originality Incremental advance

AI Analysis

This work addresses the inadequacy of n-gram novelty for evaluating creativity in text generation, which is important for researchers and practitioners in NLP and AI, though it is incremental as it builds on existing theoretical critiques.

The paper tackled the problem of using n-gram novelty as a metric for textual creativity by analyzing expert annotations of 7542 texts, finding that while n-gram novelty is positively associated with creativity, 91% of top-quartile expressions by n-gram novelty were not judged as creative, and higher n-gram novelty in open-source LLMs correlates with lower pragmaticality.

N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

View on arXiv PDF

Similar