GENIE: A Fine-Grained Measure for Novelty
For researchers and practitioners evaluating LLM creativity, GENIE offers a more precise tool to diagnose novelty deficiencies and assess mitigation methods.
The paper proposes GENIE, a fine-grained evaluation metric for measuring novelty of LLM-generated responses along task-specific features, and shows it outperforms holistic metrics in capturing the multi-dimensional nature of novelty and providing actionable insights.
Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.