LGSep 12, 2024

Alignment with Preference Optimization Is All You Need for LLM Safety

arXiv:2409.07772v15 citationsh-index: 12
Originality Synthesis-oriented
AI Analysis

This work addresses safety concerns for large language models, but it is incremental as it applies existing alignment techniques to improve safety scores.

The researchers tackled the problem of enhancing LLM safety by applying preference optimization methods to the Falcon 11B model, achieving a significant boost in global safety score from 57.64% to 99.90% and reducing toxicity scores from over 0.6 to less than 0.07 in adversarial settings, though with a trade-off in reduced general capabilities like math.

We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from $57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over $0.6$ to less than $0.07$. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes