Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models
For AI safety researchers and policymakers, this work highlights the inadequacy of dataset filtering as a defense against CSAM generation, particularly for open-weight models.
The paper evaluates concept filtering defenses against CSAM generation in text-to-image models, finding that current filtering methods offer limited protection: even with near-perfect filtering, prompting strategies can generate child-related concepts with only slightly more queries than unfiltered models, and fine-tuning can reintroduce the concept. The study uses an ethical proxy (child wearing glasses) and shows that filtering harms model generality.
We evaluate the effectiveness of filtering child images from training datasets of text-to-image models to prevent model misuse to create child sexual abuse material (CSAM). First, we capture the complexity of preventing CSAM generation using a game-based security definition. Second, we show that current detection methods cannot remove all children from a dataset. Third, using an ethical proxy for CSAM (a child wearing glasses), we show that even when only a small percentage of child images are left in the training dataset after filtering, there exist prompting strategies that generate a child wearing glasses using only a few more queries than when the model is trained on the unfiltered data. Fine-tuning the filtered model on child images further reduces the additional query overhead. We also show that re-introducing a concept is possible via fine-tuning even if filtering is perfect. Our results show that current child filtering methods offer limited protection to closed-weight models and no protection to open-weight models, while reducing the generality of the model by hindering the generation of child-related concepts or changing their representation. We conclude by outlining challenges in conducting evaluations that establish robust evidence on the impact of concept filtering defenses for CSAM.