CVApr 27, 2023

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

arXiv:2304.14407v276 citationsh-index: 61
Originality Incremental advance
AI Analysis

This work addresses the problem of poor generalization and task-specific constraints in video understanding for real-world deployment, though it appears incremental as it builds on existing Video Foundation Models.

The paper tackles the limitations of existing deep video models by proposing ChatVideo, a tracklet-centric multimodal system that uses Video Foundation Models to annotate tracklet properties and stores them in a database for user interaction, demonstrating effectiveness in answering various video-related problems through case studies on in-the-wild videos.

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios. In this paper, we present our vision for multimodal and versatile video understanding and propose a prototype system, \system. Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit and employs various Video Foundation Models (ViFMs) to annotate their properties e.g., appearance, motion, \etc. All the detected tracklets are stored in a database and interact with the user through a database manager. We have conducted extensive case studies on different types of in-the-wild videos, which demonstrates the effectiveness of our method in answering various video-related problems. Our project is available at https://www.wangjunke.info/ChatVideo/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes