AIOct 1, 2025

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

arXiv:2510.01088v11 citationsh-index: 19
Originality Highly original
AI Analysis

This addresses the problem of scalable AI safety for LLM developers by enabling more autonomous alignment without extensive human oversight, though it builds incrementally on existing reinforcement learning approaches.

The paper tackles the challenge of ensuring LLM safety without universal standards or reliable external validators by discovering that aligned models already possess internal safety beliefs, and introduces Safety Instincts Reinforcement Learning (SIRL) that uses this internal confidence as a self-generated reward signal, achieving 89%+ Defense Success Rates against 20+ jailbreak methods while preserving performance on other benchmarks.

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal--models intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes