LGApr 23, 2025

Safety Pretraining: Toward the Next Generation of Safe AI

Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zacharcy C. Lipton, J. Zico Kolter

CMUStanford

arXiv:2504.16980v230.640 citationsh-index: 15

Originality Highly original

AI Analysis

This addresses safety risks for AI deployments in high-stakes settings, offering a foundational approach rather than incremental improvements.

The paper tackles the problem of harmful content generation in large language models by introducing a data-centric pretraining framework that integrates safety from the start, reducing attack success rates from 38.8% to 8.4% on safety benchmarks without degrading general performance.

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we develop RefuseWeb and Moral Education pretraining datasets that actively teach model to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer model away from unsafe generations at inference. Our safety-pretrained models reduce attack success rates from 38.8\% to 8.4\% on standard LLM safety benchmarks with no performance degradation on general tasks.

View on arXiv PDF

Similar