CVAILGJun 22, 2025

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

arXiv:2506.18095v185 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses the lack of open access to advanced image generation capabilities for researchers, though it is incremental as it builds on existing models and datasets.

The authors tackled the problem of inaccessible proprietary multimodal image generation models by creating ShareGPT-4o-Image, a dataset synthesized with GPT-4o, and Janus-4o, a model that improves text-to-image generation and adds text-and-image-to-image generation, achieving this with only 91K samples and 6 hours of training.

Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes