CVLGMar 22

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

arXiv:2603.2120893.41 citationsh-index: 21Has Code
Predicted impact top 11% in CV · last 90 daysOriginality Highly original
AI Analysis

This work exposes structural weaknesses in safety filters for text-to-image models, which is a security concern for developers and users of AI image generation systems.

The paper tackles the problem of jailbreaking text-to-image models to generate harmful content by proposing JANUS, a lightweight framework that optimizes prompt distributions under a black-box reward, resulting in improved attack success rates from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo.

Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes