LG AI CRFeb 25, 2025

Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, Minlie Huang

arXiv:2503.01865v119.79 citationsh-index: 65Has CodeACL

Originality Incremental advance

AI Analysis

This addresses the problem of improving cross-model attack effectiveness for security researchers, though it is incremental as it builds on existing gradient-based methods.

The study tackled the limited transferability of gradient-based jailbreaking attacks on LLMs by identifying and removing superfluous constraints, resulting in an increase in the overall Transfer Attack Success Rate from 18.4% to 50.3% across target models.

Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.

View on arXiv PDF Code

Similar