CVFeb 7, 2025

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, Xiawu Zheng, Yan Zhang

arXiv:2502.05177v329.535 citationsh-index: 25Has Code

Originality Highly original

AI Analysis

This work addresses the problem of long-context multi-modal understanding for the open-source community, providing a competitive baseline for advancing this area of research.

Long-VITA tackles the problem of scaling large multi-modal models to 1 million tokens, achieving state-of-the-art performance on various multi-modal benchmarks with a 2x prefill speedup and 4x context length extension. It demonstrates advanced performances on short-context multi-modal tasks, processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.

We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.

View on arXiv PDF Code

Similar