CR AIMar 17, 2025

MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang

arXiv:2503.12931v26.41 citationsh-index: 9

Originality Highly original

AI Analysis

This addresses the challenge of universal defense against jailbreaks for safe LLM deployment, representing a novel method rather than an incremental improvement.

The paper tackles the problem of defending large language models against diverse jailbreak attacks by proposing MirrorShield, a defense model that uses dynamically generated 'mirror' prompts to detect and calibrate risky inputs, achieving superior performance and generalization compared to ten state-of-the-art attack methods.

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of universal defense against diverse jailbreaks. We propose a new concept ``mirror'', which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

View on arXiv PDF

Similar