CV AIFeb 1, 2024

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Eileen Wang, Soyeon Caren Han, Josiah Poon

arXiv:2402.00319v134.6104 citationsh-index: 26EACL

Originality Incremental advance

AI Analysis

This work addresses the challenge of making visual storytelling more engaging and human-like for applications in AI-generated content, though it is incremental as it builds on existing graph-based methods.

The paper tackles the problem of generating coherent visual stories from image sequences by incorporating social interaction commonsense knowledge, resulting in a framework that outperforms existing methods across multiple metrics like visual grounding, coherence, diversity, and humanness in evaluations.

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.

View on arXiv PDF

Similar