CVLGMMDec 6, 2024

LinVT: Empower Your Image-level Large Language Model to Understand Videos

arXiv:2412.05185v223 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the challenge of video understanding for AI assistants by enabling efficient adaptation of existing image-LLMs, though it is incremental as it builds on prior visual LLM methods.

The authors tackled the problem of adapting image-based large language models (LLMs) to understand videos by proposing LinVT, a plug-and-play linear video tokenizer, which achieved state-of-the-art performance on various video benchmarks.

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes