CRAICVDec 11, 2024

Antelope: Potent and Concealed Jailbreak Attack Strategy

arXiv:2412.08156v13 citationsh-index: 3CIKM
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in generative models for users and developers concerned with content safety, but it is incremental as it builds on existing jailbreak attack methods.

The paper tackles the problem of generating Not-Safe-for-Work content by jailbreaking diffusion-based image models, proposing Antelope, which outperforms existing baselines across multiple defensive mechanisms.

Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes