LGDec 12, 2024

Obfuscated Activations Bypass LLM Latent-Space Defenses

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, Scott Emmons

arXiv:2412.09565v223.831 citationsh-index: 19

Originality Highly original

AI Analysis

This exposes a fundamental challenge for LLM security by revealing that current latent-space monitoring techniques are ineffective against obfuscated attacks, which is critical for developers and researchers focused on AI safety.

The paper tackled the vulnerability of latent-space defenses in large language models (LLMs) to obfuscated activations, showing that attacks can reduce recall from 100% to 0% while maintaining a 90% jailbreaking rate, though obfuscation limits model performance on complex tasks like SQL code writing.

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.

View on arXiv PDF

Similar