CVAIJan 9, 2025

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

arXiv:2501.05067v29 citationsh-index: 6
AI Analysis

This work addresses video understanding tasks for AI and computer vision researchers, presenting an incremental improvement through adaptive fusion of existing projectors.

The paper tackles the problem of video understanding by introducing LLaVA-Octopus, a video multimodal large language model that adaptively weights features from different visual projectors based on user instructions, achieving excellent performance across multiple benchmarks like video question answering and long video understanding.

In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model's performance in multimodal tasks. Experimental results demonstrate that LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as video question answering, long video understanding, and comprehensive multi-choices benchmarks, highlighting its broad application potential.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes