CVAIMay 28

GPIC: A Giant Permissive Image Corpus for Visual Generation

arXiv:2605.3034179.9Has Code
AI Analysis

This dataset addresses the need for large, accessible, and legally safe image corpora for training and evaluating visual generative models, particularly benefiting researchers requiring permissive licenses for commercial use.

GPIC introduces a large-scale, permissively licensed image dataset of ~28 trillion pixels with 100M training examples, enabling scalable visual generative modeling research. The dataset is safety-filtered, deduplicated, and hosted on Hugging Face, with a benchmark and baseline flow matching model provided.

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes