SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang

arXiv:2605.2803090.93 citationsh-index: 13Has Code

Predicted impact top 7% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For LLM practitioners, SPARD provides a robust defense against adversarial fine-tuning that preserves safety without sacrificing utility.

SPARD defends against harmful fine-tuning attacks on LLMs by combining safety-projected alternating optimization with relevance-diversity data selection, achieving the lowest average attack success rates on GSM8K and OpenBookQA while maintaining high task accuracy.

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

View on arXiv PDF Code

Similar