AIOct 30, 2024

Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms

arXiv:2410.23144v116 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This provides a copyright-minimized resource for AI researchers and developers, though it is incremental as it builds on existing dataset creation methods.

The authors tackled the problem of copyright concerns in training text-to-image models by creating PD12M, a dataset of 12.4 million public domain images with synthetic captions, which is the largest such dataset to date and enables foundation model training.

We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes