LGCLOct 3, 2025

A Granular Study of Safety Pretraining under Model Abliteration

arXiv:2510.02768v14 citationsh-index: 137Has Code
Originality Incremental advance
AI Analysis

This addresses the practical problem of ensuring AI safety in open-weight models for developers and researchers, but it is incremental as it builds on existing safety and editing techniques.

The study investigated whether common safety interventions in LLMs, such as refusal training, remain effective after model abliteration edits, finding that it provides a checkpoint-level characterization of robust safety components and quantifies judge selection impacts.

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes