CVAICLLGMMNov 9, 2023

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Microsoft
arXiv:2311.05437v1232 citationsh-index: 40
Originality Incremental advance
AI Analysis

This addresses the need for more capable multimodal AI agents for real-world applications, though it appears incremental by building on existing models.

The paper tackles the problem of enhancing large multimodal models by developing LLaVA-Plus, a general-purpose multimodal assistant that uses a skill repository of pre-trained models to activate tools for tasks like visual understanding and generation, resulting in improved performance over LLaVA and enabling new capabilities.

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes