CL AIOct 23, 2025

CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li

arXiv:2510.20091v113.06 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This addresses the need for a comprehensive evaluation framework for LLM creativity, which is incremental as it builds on existing fragmented methods by integrating them into a structured benchmark.

The authors tackled the problem of evaluating creativity in large language models by proposing CreativityPrism, a holistic benchmark that decomposes creativity into quality, novelty, and dimensions across nine tasks and three domains, revealing a notable performance gap between proprietary and open-source models and showing that strong performance in one creativity dimension does not generalize to others.

Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.

View on arXiv PDF

Similar