CVAINov 2, 2023

Long Story Short: a Summarize-then-Search Method for Long Video Question Answering

arXiv:2311.01233v17 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses the challenge of limited supervision data for diverse narrative tasks in multimedia content, offering a zero-shot solution for long video QA.

The paper tackles the problem of zero-shot question answering on long multimodal narratives like movies by proposing a framework that summarizes the video's plot and searches relevant parts, achieving a large margin improvement over state-of-the-art supervised models.

Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks without requiring task-specific training data. This capability has been particularly effective in settings such as narrative question answering, where the diversity of tasks is immense, but the available supervision data is small. In this work, we investigate if such language models can extend their zero-shot reasoning abilities to long multimodal narratives in multimedia content such as drama, movies, and animation, where the story plays an essential role. We propose Long Story Short, a framework for narrative video QA that first summarizes the narrative of the video to a short plot and then searches parts of the video relevant to the question. We also propose to enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art supervised models by a large margin, highlighting the potential of zero-shot QA for long videos.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes