CLApr 13, 2025

SaRO: Enhancing LLM Safety through Reasoning-based Alignment

arXiv:2504.09420v121 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses safety issues in LLMs for users and developers, but it appears incremental as it builds on existing alignment techniques like DPO.

The paper tackled the challenges of under-generalization and over-alignment in safety alignment for large language models by proposing SaRO, a framework that incorporates safety-policy-driven reasoning, resulting in superior performance over traditional methods in experiments.

Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes