CLNov 1, 2024

Plentiful Jailbreaks with String Compositions

arXiv:2411.01084v34.23 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses a persistent security problem for LLM developers and users, though it is incremental as it builds on existing encoding-based attacks.

The authors tackled the vulnerability of large language models to adversarial attacks by developing a framework of invertible string transformations, which enabled automated jailbreaks with competitive success rates on leading models as evaluated on HarmBench.

Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes