CL AIFeb 2

Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

Alex Argese, Pasquale Lisena, Raphaël Troncy

arXiv:2602.02290v10.6h-index: 13

Originality Synthesis-oriented

AI Analysis

This work addresses the evaluation challenge for AI-generated scientific narratives, which is important for researchers and educators, but it is incremental as it builds on existing metrics with a new framework.

The paper tackles the problem of evaluating AI-generated scientific stories, which are challenging due to the need for abstraction and creativity, and proposes StoryScore, a composite metric that integrates multiple components like semantic alignment and hallucination detection, showing that existing methods often fail to distinguish creativity from factual errors.

Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.

View on arXiv PDF

Similar