AICLCVLGOct 26, 2023

GROOViST: A Metric for Grounding Objects in Visual Storytelling

arXiv:2310.17770v1135 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation tools in visual storytelling, particularly for researchers and practitioners focusing on multi-modal AI, though it is incremental as it builds on prior metrics.

The authors tackled the problem of evaluating visual grounding in visual storytelling by proposing GROOViST, a novel metric that accounts for cross-modal dependencies and temporal misalignments, achieving improved alignment with human judgments compared to existing metrics.

A proper evaluation of stories generated for a sequence of images -- the task commonly referred to as visual storytelling -- must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes