AIMay 3

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

arXiv:2605.0189992.5Has Code
AI Analysis

For LLM safety researchers, this work addresses the vulnerability of current safety alignment to persona-based jailbreak attacks, offering a novel defense mechanism.

The paper proposes Persona-Invariant Alignment (PIA), an adversarial self-play framework to defend against persona-based jailbreak attacks on LLMs. PIA reduces Attack Success Rate (ASR) while preserving general model capability, demonstrating robustness against such attacks.

The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes