CVAICLAug 6, 2024

LLaVA-OneVision: Easy Visual Task Transfer

arXiv:2408.03326v32542 citationsh-index: 21
AI Analysis

This work addresses the problem of fragmented visual AI models for researchers and practitioners by providing a unified solution, though it appears incremental as it builds on prior LLaVA-NeXT insights.

The paper tackles the challenge of creating a single multimodal model that excels across diverse visual scenarios, resulting in LLaVA-OneVision achieving state-of-the-art performance in single-image, multi-image, and video tasks with demonstrated strong transfer learning capabilities.

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes