CLAIAug 14, 2024

Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

arXiv:2408.07663v22 citationsh-index: 9Has Code
AI Analysis

This addresses the security issue of harmful content generation for users of large language models, representing an incremental improvement over prior defenses.

The paper tackles the problem of jailbreak attacks in large language models by proposing Alignment-Enhanced Decoding (AED), which uses adaptive decoding to enhance safety alignment while maintaining helpfulness, with experiments across five models and four jailbreaks validating its effectiveness.

Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at https://github.com/GIGABaozi/AED.git.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes