LGDec 12, 2024

Obfuscated Activations Bypass LLM Latent-Space Defenses

arXiv:2412.09565v227 citationsh-index: 19
AI Analysis

This exposes a fundamental challenge for LLM security by revealing that current latent-space monitoring techniques are ineffective against obfuscated attacks, which is critical for developers and researchers focused on AI safety.

The paper tackled the vulnerability of latent-space defenses in large language models (LLMs) to obfuscated activations, showing that attacks can reduce recall from 100% to 0% while maintaining a 90% jailbreaking rate, though obfuscation limits model performance on complex tasks like SQL code writing.

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes