CVDec 16, 2023

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang

arXiv:2312.10300v322.055 citationsh-index: 11Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of comprehensive video understanding for researchers by providing a new benchmark, though it is incremental as it builds on existing video understanding tasks.

The authors introduced Shot2Story, a benchmark for multi-shot video understanding that includes shot-level captions, video summaries, and QA pairs, and found that even imperfect summaries from their benchmark can achieve competitive performance on existing video QA tasks.

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

View on arXiv PDF Code

Similar