AINov 17, 2025

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

arXiv:2511.13626v13 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of abstract and unbenchmarked creativity evaluation for researchers and developers of MLLMs, though it is incremental as it builds on existing MLLM frameworks.

The authors tackled the challenge of evaluating creativity in multimodal large language models (MLLMs) by introducing CreBench, a benchmark and dataset for human-aligned creativity assessment, and CreExpert, a fine-tuned model that significantly outperforms state-of-the-art MLLMs like GPT-4V and Gemini-Pro-Vision in alignment with human judgments.

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes