AICRSep 28, 2025

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

arXiv:2509.23558v12 citationsh-index: 20Has Code
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLM alignment methods, though it is incremental as it builds on existing jailbreaking techniques.

The paper tackles the problem of prompt jailbreaking attacks on large language models by proposing the PASS framework, which uses reinforcement learning to formalize jailbreak prompts and a GraphRAG system to strengthen attacks, achieving effective bypass of alignment defenses in experiments on open-source models.

Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes