CVMMJul 31, 2024

Learning Video Context as Interleaved Multimodal Sequences

arXiv:2407.21757v215 citationsh-index: 18Has Code
AI Analysis

This addresses video understanding for narrative content like movies, but it is incremental as it builds on existing multimodal and instruction-tuning approaches.

The paper tackles the challenge of understanding narrative videos by introducing MovieSeq, a multimodal language model that represents videos as interleaved sequences of images, plots, videos, and subtitles, achieving validation across six datasets and five settings.

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at https://github.com/showlab/MovieSeq.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes