CVAIMar 11, 2025

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

arXiv:2503.08741v39 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of expensive and privacy-limited multimodal data collection for MLLM training, offering an automated synthesis method that is incremental in improving data diversity and domain-specific abilities.

The paper tackles the problem of synthesizing multimodal training data for MLLMs without compromising diversity and quality, proposing Oasis, which uses only images to generate data, and shows it significantly improves MLLM performance in experiments on LLaVA-NeXT with over 500k data points.

The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available at https://github.com/Letian2003/MM_INF.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes