CV AI CLAug 6, 2024

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

arXiv:2408.03326v367.82759 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of fragmented visual AI models for researchers and practitioners by providing a unified solution, though it appears incremental as it builds on prior LLaVA-NeXT insights.

The paper tackles the challenge of creating a single multimodal model that excels across diverse visual scenarios, resulting in LLaVA-OneVision achieving state-of-the-art performance in single-image, multi-image, and video tasks with demonstrated strong transfer learning capabilities.

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

View on arXiv PDF Code

Similar