CL LGJan 19

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong, Michael Umeokoli, Samuel Ratnam

arXiv:2601.12639v10.6

Originality Incremental advance

AI Analysis

This work addresses the problem of maintaining alignment and robustness in fine-tuned LLMs for AI safety researchers and practitioners, offering insights into objective selection as a key factor in mitigating safety degradation.

The study investigated how different fine-tuning objectives affect the safety, robustness, and persona stability of large language models, finding that at small training budgets, objectives have minimal impact on safety, but at larger scales, supervised and preference-based tuning increase vulnerability while objectives like ORPO and KL-regularization significantly reduce adversarial risks and persona drift.

Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.

View on arXiv PDF

Similar