CVJul 31, 2023

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang

arXiv:2307.16449v448.0600 citationsh-index: 60Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of handling long videos for AI systems, which is incremental as it builds on existing video and language model integrations.

The paper tackles the challenge of long video understanding by proposing MovieChat, a system that integrates video foundation models with large language models using a memory mechanism inspired by the Atkinson-Shiffrin model, achieving state-of-the-art performance and releasing a benchmark with 1K videos and 14K annotations.

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

View on arXiv PDF Code

Similar