CLAICVFeb 23, 2025

A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety

arXiv:2503.00020v14 citationsh-index: 3IEEE Access
Originality Synthesis-oriented
AI Analysis

It addresses the problem of dataset quality and composition for researchers in T2I model safety, enabling better assessment of downstream impacts and ethical considerations, though it is incremental as it reviews existing datasets without proposing new methods.

This paper systematically reviews publicly available datasets used for text-to-image generative AI safety research, analyzing their collection methods, compositions, and diversity to help researchers select appropriate datasets and identify gaps in coverage and quality.

Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation, making the quality and composition of these datasets crucial. This paper presents a comprehensive review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets. By highlighting the strengths and limitations of the datasets, this study enables researchers to find the most relevant datasets for a use case, critically assess the downstream impacts of their work given the dataset distribution, particularly regarding model safety and ethical considerations, and also identify the gaps in dataset coverage and quality that future research may address.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes