CV AIMay 28

GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, Li Fei-Fei

arXiv:2605.3034179.9Has Code

AI Analysis

This dataset addresses the need for large, accessible, and legally safe image corpora for training and evaluating visual generative models, particularly benefiting researchers requiring permissive licenses for commercial use.

GPIC introduces a large-scale, permissively licensed image dataset of ~28 trillion pixels with 100M training examples, enabling scalable visual generative modeling research. The dataset is safety-filtered, deduplicated, and hosted on Hugging Face, with a benchmark and baseline flow matching model provided.

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

View on arXiv PDF

Similar