Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models
This addresses the problem of weakly grounded creativity evaluation for AI researchers and practitioners, offering a more valid benchmark.
The paper tackled the problem of evaluating creativity in large language models by showing that the widely used Divergent Association Task (DAT) yields invalid results (LLM scores lower than non-creative baselines) and introducing the Conditional Divergent Association Task (CDAT) that measures novelty conditional on appropriateness. Under CDAT, smaller model families often show the most creativity, while advanced families favor appropriateness at lower novelty.
Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separating noise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputs more appropriate but less creative. We release the dataset and code.