AIOct 30, 2024

Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms

Jordan Meyer, Nick Padgett, Cullen Miller, Laura Exline

arXiv:2410.23144v116 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This provides a copyright-minimized resource for AI researchers and developers, though it is incremental as it builds on existing dataset creation methods.

The authors tackled the problem of copyright concerns in training text-to-image models by creating PD12M, a dataset of 12.4 million public domain images with synthetic captions, which is the largest such dataset to date and enables foundation model training.

We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

View on arXiv PDF

Similar