StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles
This addresses the issue of incorrect dialogue attribution and character interactions in visual storytelling for AI researchers, but it is incremental as it builds on prior work in visual grounding and entity re-identification.
The authors tackled the problem of visual storytelling models hallucinating semantic relationships by introducing StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles, and fine-tuned Qwen Storyteller3 to achieve an 89.9% win rate on subtitle alignment and 48.5% versus 38.0% on dialogue attribution compared to a baseline.
Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.