CVJun 17, 2024

Generative Visual Instruction Tuning

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

arXiv:2406.11262v27.64 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for advanced general-purpose visual assistants by enabling better multimodal task performance, though it is incremental as it builds on existing models and datasets.

The authors tackled the problem of enhancing zero-shot capabilities of large multimodal models for generative and image editing tasks by automatically generating instruction-following data, resulting in GenLLaVA, which shows superior visual understanding to LLaVA and competitive results with models like Unified-IO 2.

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

View on arXiv PDF Code

Similar