CLAICRJun 3, 2024

Decoupled Alignment for Robust Plug-and-Play Adaptation

arXiv:2406.01514v313 citations
Originality Incremental advance
AI Analysis

This provides a low-resource safety enhancement method for AI developers working with large language models, though it appears incremental as it builds on existing knowledge distillation techniques.

The paper tackles the problem of aligning large language models for safety without requiring supervised fine-tuning or reinforcement learning from human feedback, achieving an average defense success rate improvement of approximately 14.41% (up to 51.39%) on harmful question datasets across 17 unaligned models.

We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes