AIMar 18

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

arXiv:2603.1736879.2h-index: 7
Predicted impact top 36% in AI · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a critical safety issue for users of large reasoning models, but it is incremental as it builds on existing safety alignment techniques.

The paper tackles the problem that large reasoning models (LRMs) with chain-of-thought (CoT) generation suffer from degraded safety capabilities, and it proposes a safety alignment method that improves safety by promoting safety decision-making before CoT generation, resulting in substantial safety improvements while maintaining reasoning performance.

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes