CVJun 17, 2024

Generative Visual Instruction Tuning

arXiv:2406.11262v24 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the need for advanced general-purpose visual assistants by enabling better multimodal task performance, though it is incremental as it builds on existing models and datasets.

The authors tackled the problem of enhancing zero-shot capabilities of large multimodal models for generative and image editing tasks by automatically generating instruction-following data, resulting in GenLLaVA, which shows superior visual understanding to LLaVA and competitive results with models like Unified-IO 2.

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes