CVAIMMAug 28, 2024

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

arXiv:2408.15542v1135 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the problem of effective long-video processing for AI applications, representing an incremental advancement in video-language models.

The paper tackles the challenge of extending large language models to process long videos by introducing Kangaroo, a video-language model that achieves state-of-the-art performance on various video understanding benchmarks, including excelling on long-video benchmarks with 8B parameters compared to larger models.

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes