CVCLMay 15, 2025

StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

arXiv:2505.10292v22 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses referential hallucinations in visual storytelling for AI systems, though it is incremental as it builds on existing chain-of-thought and grounding methods.

The paper tackles the problem of visual storytelling systems struggling with character identity consistency and referential hallucinations by introducing the StoryReasoning dataset with 4,178 stories from movie images, featuring structured scene analyses and grounded stories, and shows that fine-tuning Qwen2.5-VL 7B reduces hallucinations by 12.3% and improves creativity by 31.0%.

Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story and an improvement in creativity from 2.58 to 3.38 (+31.0%) when compared to a non-fine-tuned model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes