CLAICRFeb 26, 2024

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

arXiv:2402.16717v169 citationsh-index: 40
Originality Highly original
AI Analysis

This addresses adversarial misuse for LLM security, presenting a novel attack method with high effectiveness.

The paper tackles the problem of jailbreaking LLMs by circumventing safety mechanisms, proposing CodeChameleon, a personalized encryption framework that reformulates tasks into code completion to evade intent recognition, achieving an 86.6% attack success rate on GPT-4-1106.

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes