CYAICLAug 12, 2025

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

arXiv:2508.09224v141 citationsh-index: 11Robotics
Originality Incremental advance
AI Analysis

This addresses safety issues in AI assistants for dual-use domains like biology or cybersecurity, representing an incremental improvement over existing refusal-based methods.

The paper tackles the brittleness of binary refusal boundaries in large language models by proposing safe-completions, a safety-training approach that focuses on output safety rather than user intent classification, resulting in improved safety on dual-use prompts and increased helpfulness in GPT-5.

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes