LG CLOct 3, 2025

A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele, Venkatesh Babu Radhakrishnan, Margret Keuper

arXiv:2510.02768v113.04 citationsh-index: 137Has Code

Originality Incremental advance

AI Analysis

This addresses the practical problem of ensuring AI safety in open-weight models for developers and researchers, but it is incremental as it builds on existing safety and editing techniques.

The study investigated whether common safety interventions in LLMs, such as refusal training, remain effective after model abliteration edits, finding that it provides a checkpoint-level characterization of robust safety components and quantifies judge selection impacts.

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

View on arXiv PDF Code

Similar