CLAICVLGApr 27, 2025

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

arXiv:2504.19267v313 citationsh-index: 98
Originality Incremental advance
AI Analysis

This work addresses the problem of generating coherent and visually grounded stories from images for applications in AI and human-computer interaction, representing an incremental advancement in the field.

The paper tackles visual storytelling by generating narratives from image sequences using a novel approach with multimodal models, achieving improved evaluation through new metrics like RoViST and GROOVIST that better align with human judgment.

Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes