LGAICLCROct 20, 2024

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

arXiv:2410.15362v121 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This work addresses the vulnerability of aligned LLMs to adversarial attacks, which is crucial for preventing misuse, but it is incremental as it builds on the existing GCG method.

The authors tackled the problem of inefficient and limited jailbreak attacks on aligned large language models by proposing Faster-GCG, which reduces computational cost to 1/10 of the original GCG while achieving higher attack success rates and improved transferability to closed-source models like ChatGPT.

Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes