CLAICRSep 24, 2025

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

arXiv:2509.19775v1h-index: 24
Originality Highly original
AI Analysis

This addresses security vulnerabilities in LLMs for users and developers, representing a strong specific gain in adversarial robustness.

The paper tackles the problem of embedding jailbreak backdoor attacks in large language models (LLMs) by proposing bi-GRPO, a reinforcement learning framework that achieves over 99% attack success rate while maintaining stealthiness and usability in non-trigger scenarios.

With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes