MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
This addresses the challenge of lightweight safety defenses for LLMs, offering a novel approach that avoids harmful training data and architectural modifications, though it appears incremental in improving upon existing classifier-based defenses.
The paper tackles the problem of defending large language models (LLMs) against adversarial jailbreak attacks by proposing MANATEE, an inference-time defense that uses density estimation and diffusion to project anomalous representations toward safe regions, resulting in up to 100% reduction in Attack Success Rate on certain datasets while preserving utility on benign inputs.
Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.