CRMay 12

Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery

arXiv:2605.1256575.4
Predicted impact top 21% in CR · last 90 daysOriginality Incremental advance
AI Analysis

For AI safety researchers, PCAP provides a method to discover more diverse and realistic jailbreaks, reducing underestimation of real-world risk.

PCAP conditions adversarial prompt search on attacker personas and strategy cards, increasing attack success rate on GPT-OSS~120B from ~58% to ~97% and improving prompt diversity.

Existing automated red-teaming pipelines often miss attacks that depend on attacker identity, framing, or multi-turn tactics. This under-coverage underestimates real-world risk. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on attacker personas and strategy cards and runs parallel persona-conditioned beam searches to discover diverse, transferable jailbreaks. PCAP is orthogonal to the underlying search algorithm and substantially increases attack success rate (ASR) and prompt diversity (e.g., ASR on GPT-OSS~120B from $\approx58\% \rightarrow \approx97\%$), improving attack strategy coverage and diversity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes