LG AI CLApr 9, 2025

Bridging the Gap Between Preference Alignment and Machine Unlearning

Xiaohua Feng, Yuyuan Li, Huwei Ji, Jiaming Zhang, Li Zhang, Tianyu Du, Chaochao Chen

arXiv:2504.06659v111.44 citationsh-index: 63

Originality Incremental advance

AI Analysis

This work addresses the problem of costly and unstable PA methods for LLMs, offering a more efficient approach, though it appears incremental as it builds on existing unlearning techniques.

The paper tackles the challenge of improving Preference Alignment (PA) for Large Language Models by bridging it with machine unlearning, proposing a framework that optimally selects negative examples for unlearning to enhance PA performance, with experimental validation confirming effectiveness.

Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness.

View on arXiv PDF

Similar