Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
This provides a copyright-minimized resource for AI researchers and developers, though it is incremental as it builds on existing dataset creation methods.
The authors tackled the problem of copyright concerns in training text-to-image models by creating PD12M, a dataset of 12.4 million public domain images with synthetic captions, which is the largest such dataset to date and enables foundation model training.
We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.