CVJan 26, 2025

TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler

arXiv:2501.15513v23 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the inaccessibility of video understanding models for researchers due to high computational costs, though it is incremental in improving efficiency.

The authors tackled the problem of resource-intensive large multimodal models for video understanding by introducing TinyLLaVA-Video, a lightweight model with approximately 3.6B parameters that surpasses several existing 7B-parameter models on multiple benchmarks.

Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks. In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks. We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes