CV CLMay 10, 2023

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao

arXiv:2305.06355v252.7954 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses video understanding for researchers and developers by proposing a prototype system that could enhance interactive video analysis, though it appears incremental as it builds on existing models with a new interface.

The paper tackles video understanding by integrating video foundation models and large language models into an end-to-end chat-centric system called VideoChat, which excels in spatiotemporal reasoning, event localization, and causal inference, with preliminary qualitative experiments showing its potential across various video applications.

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

View on arXiv PDF Code

Similar