LGJun 4, 2025

Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

arXiv:2506.03850v214 citationsh-index: 8Has CodeICML
Originality Incremental advance
AI Analysis

This addresses safety risks in LLMs from harmful fine-tuning, offering a targeted mitigation method, though it is incremental as it builds on existing alignment techniques.

The paper tackles the problem of harmful fine-tuning breaking safety alignment in LLMs by revealing that certain alignment data subsets are more prone to forgetting, and proposes Vulnerability-Aware Alignment (VAA) to mitigate this, showing it significantly reduces harmful scores while preserving task performance across four fine-tuning tasks.

Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes