LGAIMay 25

Curriculum Learning for Safety Alignment

arXiv:2605.2631596.1Has Code
Predicted impact top 3% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For LLM safety practitioners, a method to enhance robustness of DPO alignment without sacrificing general performance.

Curriculum learning improves DPO-based safety alignment, reducing OOD harmful responses by 16% and jailbreak success by 20% while preserving general capabilities.

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes