CL AIMar 10

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

arXiv:2604.0001276.92 citations

Predicted impact top 78% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a critical safety issue for users of post-trained LLMs, such as those in reasoning or medical domains, by providing a cost-effective solution to mitigate harmful behaviors, though it is incremental as it builds on existing fine-tuning techniques.

The paper tackles the problem of safety degradation in post-trained large language models (LRMs) by identifying that post-training masks original safety mechanisms, and proposes SafeReAct, a lightweight method that restores safety behaviors using LoRA adapters on a few layers, significantly improving safety on harmful prompts without compromising reasoning performance across four state-of-the-art LRMs.

Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs' safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.

View on arXiv PDF

Similar