LGAICLFeb 6, 2025

Safety Reasoning with Guidelines

arXiv:2502.04040v210 citationsh-index: 34ICML
AI Analysis

This addresses safety challenges in LLMs for AI deployment, but it is incremental as it builds on existing refusal training methods.

The paper tackles the problem of training safe LLMs that generalize against out-of-distribution jailbreaking attacks, showing that vanilla refusal training fails to elicit latent safety knowledge, and proposing a method that improves generalization significantly.

Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes