Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
This addresses the challenge of efficient long video understanding for AI applications, though it appears incremental as it builds upon existing token compression techniques.
The paper tackles the problem of hour-long video understanding in multimodal large language models by proposing Video-XL-Pro, which uses a learnable Reconstructive Compression of Tokens module to generate compact video tokens, resulting in outperforming most 7B models with only 3B parameters and processing over 8K frames on a single A100 GPU.
Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.