LGCLCRFeb 23, 2024

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

arXiv:2402.15180v230 citationsh-index: 3Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Originality Highly original
AI Analysis

This addresses the issue of fast-developing adversarial attacks on language models for users needing immediate, low-cost safety without extensive training.

The paper tackles the problem of language models being vulnerable to jailbreak attacks by proposing a self-refinement method with formatting, which achieves outstanding safety in non-safety-aligned models and is the safest training-free defense, reducing attack success rates with fewer iterations and computational costs.

Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes