CVCLMay 10, 2023

VideoChat: Chat-Centric Video Understanding

arXiv:2305.06355v2937 citationsHas Code
AI Analysis

This work addresses video understanding for researchers and developers by proposing a prototype system that could enhance interactive video analysis, though it appears incremental as it builds on existing models with a new interface.

The paper tackles video understanding by integrating video foundation models and large language models into an end-to-end chat-centric system called VideoChat, which excels in spatiotemporal reasoning, event localization, and causal inference, with preliminary qualitative experiments showing its potential across various video applications.

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes