Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
This work addresses the need for efficient data generation in multimodal AI, offering a method to reduce reliance on external models like GPT-4, though it is incremental as it builds on existing MLLM capabilities.
The paper tackles the problem of generating visual instruction tuning data without relying on GPT-4, proposing Genixer, a pipeline that uses multimodal large language models (MLLMs) as data generators. It shows that synthetic datasets improve performance on multimodal benchmarks, with LLaVA1.5 enhancing 10 out of 12 benchmarks and Shikra improving on 7 out of 8 REC datasets.
Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer.