CVJul 13, 2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

arXiv:2307.06942v2490 citationsh-index: 71
Originality Incremental advance
AI Analysis

This provides a scalable tool for researchers and practitioners in multimodal video understanding and generation, though it is incremental in building on existing dataset and model approaches.

The paper tackles the problem of limited video-text data for multimodal AI by introducing InternVid, a large-scale dataset with over 7 million videos and 4.1 billion words, and shows that training a model on it achieves leading zero-shot action recognition and competitive video retrieval performance.

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes