CVFeb 6

Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds

arXiv:2602.07149v1h-index: 6
Originality Synthesis-oriented
AI Analysis

This work highlights privacy risks in widely used public datasets, which is crucial for dataset curators and AI ethics practitioners, though it is incremental as it applies existing methods to a new domain.

The study investigated the presence of sensitive personal information in large-scale image datasets by examining pregnancy ultrasound images in LAION-400M, finding thousands of entities like names and locations that pose re-identification risks.

The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes