"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks
This work addresses the problem of evaluating and improving guardrail mechanisms in LLMs for developers and users concerned with AI safety, though it is incremental as it applies existing testing methods to new models.
The research tested the guardrail effectiveness of five large language models (GPT-4o, Grok-2 Beta, Llama 3.1, Gemini 1.5, Claude 3.5 Sonnet) against multi-step jailbreak prompts simulating corporate competition scenarios, finding that all models' guardrails were bypassed to generate verbal attack content, with Claude 3.5 Sonnet showing more resistance.
As the application of large language models continues to expand in various fields, it poses higher challenges to the effectiveness of identifying harmful content generation and guardrail mechanisms. This research aims to evaluate the guardrail effectiveness of GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5, and Claude 3.5 Sonnet through black-box testing of seemingly ethical multi-step jailbreak prompts. It conducts ethical attacks by designing an identical multi-step prompts that simulates the scenario of "corporate middle managers competing for promotions." The data results show that the guardrails of the above-mentioned LLMs were bypassed and the content of verbal attacks was generated. Claude 3.5 Sonnet's resistance to multi-step jailbreak prompts is more obvious. To ensure objectivity, the experimental process, black box test code, and enhanced guardrail code are uploaded to the GitHub repository: https://github.com/brucewang123456789/GeniusTrail.git.